Build indexes by shard using multiple JVMs

When indexing takes too long, and further tuning does not seem to help, it could be that your server is approaching its physical limitations. To avoid such a condition, we can distribute the index across two or more Search servers, so that the indexing workload is also distributed. We can distribute your index across multiple Java Virtual Machines.


Before starting

In order to perform index sharding with the Search server, first set up your shard environment using the following guidelines.


Task info

To distribute an index, you divide it into partitions called shards. Shards can share the same Search Docker container or be run separately in their own Search containers, depending on the system's performance. If your Search shard containers and the Search master container are not located in the same virtual machine or physical machine, we might need extra distributed file system technology, such as remoteStorage, to help the Search shard containers share an index folder with the master container.

Once all of the index shards are successfully populated, they can be merged in one optimized index, to be used with the storefront for sorting, faceting and filtering.

There are two stages to index sharding with multiple JVMs, as there are with a single JVM. In the first stage, the original data is split into multiple ranges. Each range of data is preprocessed and they are indexed into a shard’s index cores in parallel. In the second stage, all the shards’ index cores are merged into the Search master's corresponding index core. For example, CatalogEntry structured shard indexes are merged into the Search master CatalogEntry structured index, and unstructured shard indexes are merged into the Search master unstructured index core. For multiple JVMs, the standard approach is to build each shard index in a separate Search server container. If system resources permit, we can configure one Search server to build multiple shard indexes.

Some extra steps are needed to map volumes between containers and an outside file system. The Search master container needs this mapping so that it can access all shard index folders.

Once you have set up your shard environment, we can then perform indexing to each shard using the di-parallel-process utility. The following diagram shows the two stages for sharding in multiple search docker containers.


Procedure

We can use the attached sample docker-compose file to set up the environment with sharding in multiple JVMs.

  1. Copy docker-compose.yml to your development environment. Rename it docker-compose-shardingMultiJVMs.yml.

  2. Run the following command to set up the environment:

      docker-compose -f docker-compose-shardingMultiJVMs.yml up -d

    If you have not changed the default settings in the file, this command creates three search shard servers, one search master server, one transaction server, one DB2 container and one utility container. Three search shard servers are mapped to different ports for each container. The ports are http port 3737, and https port 3738. For the search shard servers, model your configuration on the following example:

      shard_a:
          image: search-app:latest
          hostname: search_shard_a
          environment:
            - WORKAREA=/shard_a
            - LICENSE=accept
            - TZ=Asia/Shanghai
          ports:
            - 3747:3737
            - 3748:3738
          volumes:
            - /shard_a
          depends_on:
            db:
                  condition: service_healthy
          healthcheck:
            test: ["CMD", "curl", "-f", "-H", "Authorization: Basic ","http://localhost:3737/search/admin/resources/health/status?type=container"]
            interval: 20s
            timeout: 180s
            	  retries: 5

    Note: Set different hostname, external ports and external folders for different search shard containers. For the search master container:

      master:
          image: search-app:latest
          hostname: search_master
          environment:
            - SOLR_MASTER=true
            - WORKAREA=/search
            - LICENSE=accept
            - TZ=Asia/Shanghai
          ports:
            - 3737:3737
            - 3738:3738
          volumes_from:
            - shard_a:ro
            - shard_b:ro
            - shard_c:ro
          networks:
            default:
              aliases:
                - search
          depends_on:
            db:
                  condition: service_healthy
      healthcheck:
            test: ["CMD", "curl", "-f", "-H", "Authorization: Basic ","http://localhost:3737/search/admin/resources/health/status?type=container"]
            interval: 20s
            timeout: 180s
            	  retries: 5

    Note: External volume mapping for search master node is configured for shard index folder, to allow master node access to all shard indexes.

  3. In the Utility server Docker container, modify /opt/WebSphere/CommerceServer90/properties/parallelprocess/di-parallel-process.properties to match the environment. We can use the following examples as a guide. Configure the hostname and port for different shard servers as below.

      Shard.A.common.index-server-name=shard_a
      Shard.A.common.index-server-port=3738
      …
      Shard.B.common.index-server-name=shard_b
      Shard.B.common.index-server-port=3738

    Shard server name should be the same with the hostname/alias with docker compose file. b). configure shard index core directory as below:

      Shard.A.en_US.unstructured-index-core-dir=/shard_a/index/solr/MC_10001/en_US/Unstructured_A/
      Shard.A.en_US.structured-index-core-dir=/shard_a/index/solr/MC_10001/en_US/CatalogEntry_A/
      …
      Shard.B.en_US.unstructured-index-core-dir=/shard_b/index/solr/MC_10001/en_US/Unstructured_B/
      Shard.B.en_US.structured-index-core-dir=/shard_b/index/solr/MC_10001/en_US/CatalogEntry_B/
      
      

      Note: the directory should be absolute path inside each shard container.

    We can automate the shard configuration process. This is useful if, for example, you expect to create a large number of shards. The auto-sharding process will automatically configure properties such as preprocessing-start-range-value, preprocessing-end-range-value, index-core-name and index-core-dir.

    For information on how to set up auto-sharding, see
    Sharding input properties file.

  4. Change to the /opt/WebSphere/CommerceServer90/bin directory.

  5. Run the following command to do sharding in multiple JVMs.

      ./di-parallel-process.sh /opt/WebSphere/CommerceServer90/properties/parallelprocess/di-parallel-process.properties

    See Running utilities from the Utility server Docker container.


Results

Once the merge operation is complete, the merged index will be online and immediately available for use.