The indexing process
Stages
Crawling Process of reading content from each application in order to create entries for indexing. The Search application requests a seedlist from each Connections application. The seedlist is generated when each application runs queries on the data stored in its database, based on the parameters the Search application submits in its HTTP request. The contents of the seedlists are persisted to disk. They are deleted when the next incremental indexing task completes successfully. File content extraction Search provides a document conversion service to extract the content of the files to be indexed. During the file content extraction stage, the document conversion service downloads files to a temporary folder in the index directory, converts them to plain text, and stores this in the folder defined by the WAS variable, EXTRACTED_FILE_STORE. The extracted text is then indexed. Connections supports the indexing of file attachment content from the Files and Wikis applications, and IBM FileNet documents. File content extraction takes place on the schedule defined for the file content extraction task, which runs every 20 minutes by default. File content is not searchable until the file content conversion is complete and the next indexing task has also completed.
Indexing During the indexing phase, the entries in the persisted seedlists are processed into Lucene documents, which are serialized into a database table that acts as an index cache. When the indexing phase is complete, the seedlists are removed from disk. A resume token marks where the last seedlist request finished so the Search application can start from this point on the next seedlist request. This resume token enables Search to retrieve only the new data that was added after the last seedlists were generated and crawled. The crawling and indexing stages for multiple applications take place concurrently in incremental foreground indexing. For example, if an indexing task that indexes Files, Activities, and Blogs is created, each of these applications is crawled and added to the database cache at the same time. During initial and background indexing, only the crawling stage for multiple applications takes place concurrently. During incremental foreground indexing, after the crawling and indexing stages are complete, all the nodes are notified they can build their index. At this point, the index builder on each node begins extracting entries from the database cache and storing them in the index on the local file system.
Index building Deserialization and writing of the Lucene documents into the Search index. This process only occurs during incremental foreground indexing. During index building, the index builder takes entries from the database cache and stores them in an index on the local file system. Each node has its own index builder, so crawling and preparing entries only takes place once in a network deployment, and then the index is created on each node from the information that has already been processed. During initial and background indexing, the indexing stage and the index building stage are merged, and no database serialization or deserialization occurs. Post processing After index building (for incremental foreground indexing) or indexing (for initial or background indexing), post-processing work takes place on the new index entries to add additional metadata to the search results. This work includes bookmark rollup and the addition of file content to Files search results. Bookmark rollup refers to the process of aggregating the information for public bookmarks that point to the same URL. For example, if 1000 users create a public bookmark for the same URL, when someone searches for that URL, a single bookmark is returned instead of 1000 search results. The bookmark returned includes the information for all 1000 bookmarks rolled up into a single search result, so that all of the tags and people associated with each of the individual bookmarks are now associated with the one document. In addition, if two users bookmark the same internal document, for example, a wiki page, then the wiki page gets rolled up with the bookmark so if the user then searches for the wiki page or the bookmark they created to the wiki page, only one result is returned in the search results. The tags and people associated with the bookmark and the wiki page are combined into a single document.
Indexing types
The following table explains the differences between the various types of indexing:
Foreground indexing Background indexing Initial indexing The initial index is built using the default 15min-search-indexing-task. Alternatively, it can be built by a custom indexing task created by the SearchService.addIndexingTask command or a command that is run once, such as SearchService.indexNow(String applicationNames). This index is used for searching and for further indexing. The database cache is not used. An index is built using SearchService.startBackgroundIndex. The background indexing command creates a one-off index in a specified location on disk. This index is not used for searching. The database cache is not used. Incremental indexing The index is updated using the default 15min-search-indexing-task. Alternatively, the index can be updated by a custom indexing task created by the SearchService.addIndexingTask command or a command that is run once, such as SearchService.indexNow. This index is used for searching and for further indexing. The database cache is used. A background index can be updated using SearchService.startBackgroundIndex. This index is not used for searching. The database cache is not used.
Indexing steps
The indexing process involves the following steps:
Initial and background indexing
- Crawl all pages of the seedlist and persist them to disk.
- Extract the file content and persist it to disk.
- Crawl a seedlist page from disk.
- Index the seedlist entries into Lucene documents.
- Write the documents to the Lucene index.
- Repeat until all the persisted seedlist pages have been crawled.
Incremental foreground indexing
- The node that has the scheduler lease crawls all the pages of the seedlist and persists them to disk.
- Crawl a seedlist page from disk.
- Index the seedlist entries into Lucene documents.
- Serialize the Lucene documents into the database cache.
- Send a JMS message to all Search nodes to alert them of the completion of the serialization.
- Each node deserializes the Lucene documents into the Lucene index.
See
- Search index directory structure
- Add a service to the search index
- Check fault tolerance during initial indexing
- Manipulating the resume tokens for Connections services
Parent topic:
Manage the Search index
Related:
Configure page persistence settings
Scheduling tasks
Configure the number of indexing threads
Recreate the search index
Change the location of the Search index