Tips for Portal Search crawls

Tips for Portal Search crawls

View some useful tips about crawls that Portal Search performs. For example, crawling can require extended memory and time, depending on the Portal Search environment and configuration.

HTTP crawler does not support JavaScript

The HTTP crawler of the Portal Search Service does not support JavaScript. Therefore some text of web documents might not be accessible for search by users. This depends on how the text is prepared for presentation in the browser. Specifically text generated by JavaScript might or might not be available for search.

Crawl a portal site for the first time can result in a message

When we start the crawl on a portal site for the first time, this can result in the following message:
EJPJP0009E: Wrong root url for Portal site crawler: https://root_url

We can ignore this message. The crawl runs correctly.
To resolve this problem, edit the content source, select the General Parameters tab, and the set the parameter Stop fetching documents after (seconds): to a value of 90 seconds.

Memory required for crawls

Depending on the Portal Search environment, crawling can require large amounts of memory. Therefore, before starting a crawl, make sure that HCL WebSphere Portal has enough free memory. Memory shortage can cause a corrupted search collection and eventually lead to a system freeze.
To resolve this problem, raise the limit to the number of open files using the ulimit command as root administrator.
Due to the resources needed for a crawl and index, IBM recommends that you schedule crawls to occur when user activity is relatively low.

Time required for crawls and imports and availability of documents

The following search administration tasks can require extended periods of time:

Crawl a content source. During the crawl documents might not be immediately available for searching or browsing.

Indexing the documents fetched by a crawl. When a crawl has been complete and all documents have been collected, building the index takes some more time.

Import a search collection. When we import data to a collection, it can take some time until the content sources for the collection are shown in the Content Sources in Collection box and the documents of the imported collection are available for crawling.

These tasks are put in a queue. It might therefore take several minutes until they are executed and the respective time counters start, for example, the crawl Run time and the timeout for the crawl set by the option Stop collecting after (minutes): . The time required for these tasks is further influenced by the following factors:

The number of documents in the content source being crawled

The size of the documents in the content source being crawled

Speed and availability of the processors, hard drive storage systems, and network connection.

The value selecteded from the Stop collecting after (minutes): drop-down menu when creating or edited the content source.

Therefore both the time limits we can specify and the times shown for these processes work as approximate time limits. This applies, for example, to the following scenarios:

When we start a crawl by selecting a content source in the Content Sources in Collection box and clicking Start collecting.

When we import a search collection and when we start a crawl on the imported search collection.

When an installation is complete and you initialize the pre-configured portal site collection by selecting the portal site content source and clicking Start collecting.

The time shown under Last update completed in the collection status information is later than we might assume by just adding the crawler time limit specified by Stop collecting after (minutes): to the crawling start time. This delay is caused by the additional time required by building the index.

Furthermore, this influences other status indicators given in the Manage Search portlet. For example, the number of documents shown for a content source can show with an unexpectedly low figure or even at zero ( 0 ) until the crawl on that content source has been completed.

Parent Tips for using Portal Search