Manage the content sources of a search collection

Manage the content sources of a search collection

Search collections consist of one or more content sources. To work with content sources of a collection, click...
Administration | Search Administration | Manage Search | Search Collections | search_collection

Portal Search displays the Content Sources panel, which shows the status of the selected search collection, and lists its content sources and their status.

New Content Source

Create a new content source for the search collection selecteded from the Search Collections list.

The search crawler supports basic authentication, therefore Unicode character set for user and password is not supported.

We configure a search collection to cover multiple content source of different types. For example, we can combine portal sites, websites, and local document collections.

The selectable options and data entry fields displayed under the different configuration tabs depend on which type of content source we select.

If we select Portal site, the appropriate data for the portal site is already complete.

If we select WCM site (Web Content Manager site), we need to enter the appropriate data. For information about how to construct the URL for the content source refer to Seedlist 1.0 REST service API in the WCM documentation.

For some content sources, we might need to enter sensitive data, such as a user ID and password. For example, this action applies to secured WebSphere Portal sites or HTTP sites that require a user ID and password. To ensure encryption of this sensitive data when it is stored, update, and run the file searchsecret.xml using xmlaccess.sh before creating the content source.

If we are using form-based authentication, specify the following fields:
Each host name within a crawler can have a single form-based security definition, a single basic authentication definition, or multiple basic authentication realm definitions.

User name and Password
The user name and password associated with the login form.

URL of login FORM
Specify the submit URL value of the login form for the site that will be crawled. The crawler issues a POST request to this URL and passes it on to the user name and password, through the Ajax proxy.

User FORM field name
Specify the user name field value in the login form, where the user name is defined.

Password FORM field name
Specify the password field value in the login form, where the password is defined.

When creating a portal site content source in a portal cluster environment configured with SSL, provide the cell security information for the web server and the nodes. For example, in a cluster with the cluster URL...
https://web_server/wps/portal

...the primary node URL...
http://node_1:10039/wps/portal

...and the secondary node URL...
http://node_2:10050/wps/portal

...provide the user ID and password for the web server and both nodes 1 and 2.

Under the General parameters tab, set the URL for the content source in field...
Collect documents linked from this URL

The crawler needs this URL for crawling. For information about how to construct the URL for the content source refer to Seedlist 1.0 REST service API in the WCM documentation.
A crawler failure can be caused by URL redirection problems. If this occurs, try changing the URL to the redirected URL.

For crawling a website content source, we can set a timeout under the General parameters tab under the option...
Stop collecting after (minutes)

This timeout works as follows:

The timeout works only for website content sources.
The timeout works as an approximate time limit. It might be exceeded by some percentage.
The crawl action is put in a queue. It might therefore take several minutes until it is run and the time counter starts. It might therefore seem the crawl takes longer than the timeout set.

When we start the crawl by clicking Start Crawler, allow for some time tolerance and be aware of the time required for crawls and imports and availability of documents.

Under the Advanced Parameter tab, the entry field for the Default Character Encoding contains the initial default value windows-1252, regardless of the setting for the Default Portal Language under...
Administration | Portal Settings | Global Settings

Enter the required default character encoding, depending on the portal language. Otherwise, documents might be displayed incorrectly under Browse Documents.

Before starting the crawl, set the preferred language of the crawler user ID to match the language of the search collection that it crawls.

You start the initial crawl on a newly created content source by either of the following options:

After creating a new content source, click the Start Crawler icon. This starts an immediate crawl.

When creating the content source, define a schedule under the Schedulers tab. The crawl starts at the next possible time specified.

Use Refresh to update the list of content sources and the status shown for this collection.
Options for content source...

View Content Source Schedulers.
View and manage schedulers.
This option is only available if we defined schedulers for the content source.

Start Crawler.
Click this icon to start a crawl on the content source. This action updates the contents of the content source by a new run of the crawler. While a crawl on the content source is running, the icon changes to Stop Crawler. Click this icon to stop the crawl. Portal Search refreshes different content sources as follows:

For website content sources, documents that were indexed before and still exist in the content source are updated. Documents that were indexed before, but no longer exist in the content source are retained in the search collection. Documents that are new in the content source are indexed and added to the collection.

For WebSphere Portal sites, the crawl adds all pages and portlets of the portal to the content source. It deletes portlets and static pages from the content source that were removed from the portal. The crawl works similarly to the option...
Regather documents from Content Source

For Web Content Manager sites, Portal Search uses an incremental crawling method. Additionally to added and updated content, the Seedlist explicitly specifies deleted content. In contrast, clicking...
Regather documents from Content Source

...starts a full crawl; it does not continue from the last session, and it is therefore not incremental.

For content sources created with the seedlist provider option, a crawl on a remote system that supports incremental crawling, such as IBM Connections, behaves like a crawl on a Web Content Manager site.

Regather documents from Content Source.
This option deletes all existing documents in the content source from previous crawls and then starts a full crawl on the content source. Documents that were indexed before and still exist in the content source are updated. Documents that were indexed before, but no longer exist in the content source are removed from the collection. Documents that are new in the content source are indexed and added to the collection.

Verify Address of Content Source.
Click this icon to verify the URL of the content source is still live and available. Manage Search returns a message about the status of the content source.

Edit Content Source.
Click this icon to make changes to a content source. The changes include configuring parameters, schedules, and filters for the selected content source.

It is of benefit to define a dedicated crawler user ID.
The pre-configured default portal site search uses the default administrator user ID wpsadmin with the default password of that user ID for the crawler. If we changed the default administrator user ID during the portal installation, the crawler uses that default user ID. If we changed the user ID or password for the administrative user ID and still want to use that user ID for the Portal Search crawler, we need to adapt the settings .
To define a crawler user ID, select the Security tab, and update the user ID and password. Click Save to save the updates.

If we modify a content source that belongs to a search scope, update the scope manually to verify the scope still covers that content source.
Especially if we changed the name of the content source, edit the scope and make sure that it is still listed there. If not, add it again.

Delete Content Source

Click this icon to delete the selected content source.
If we delete a content source, then the documents that were collected from this content source remains available for search by users under all scopes, which included the content source before it was deleted. These documents are available until their expiration time ends. We can specify this expiration time under Links expire after (days): under General Parameters when creating the content source.

Status Information

View the Collection Status information of the selected search collection.
Update the status information, click the Refresh button or the refresh button of the browser.
The status fields show the following data that changes over the lifetime of the search collection:

Search Collection Name: Name of the selected search collection.
Search Collection Location: Location of the selected search collection in the file system. The full path where all data and related information of the search collection is stored.
Collection Description: Description of the selected search collection if available.
Search Collection Language: Language for which the search collection and its index are optimized. The index uses this language to analyze the documents when indexing, if no other language is specified for the document. This feature enhances the quality of search results for users, as it allows them to use spelling variants, including plurals and inflections, for the search keyword.
Summarizer used: Whether a static summarizer is enabled for this search collection. The static summarizer creates a summary of the page, which is based on the page's full content. The page's full content can include metadata, HTML elements, and Web Content Manager templates. These additional elements might be interpreted as text and thus become a part of the page's summary. Do not use the static summarizer if the page's summary contains a large amount of noise from these additional elements.
Last update completed: Date when a content source defined for the search collection was last updated by a scheduled update.
Next update scheduled: Date when the next update of a content source defined for the search collection is scheduled.
Number of active documents: Number of active documents in the search collection, that is, all documents available for search by users.

To update the status information, click Refresh.
Clicking the refresh button of the browser does not update the status information.

If we delete a portlet from the portal after a crawl of the portal site, the deleted portlet is no longer listed in the search results.
Refreshing the view does not update the status information about the Number of active documents. This information is not updated until after the next cleanup run of portal resources.

Parent Set up search collections
Related concepts:
Apply filter rules
Delayed cleanup of deleted portal pages
Tips for using Portal Search
The portal site search collection fails
Web Content Manager - Seedlist 1.0 REST service API

Search Collection Name:	Name of the selected search collection.
Search Collection Location:	Location of the selected search collection in the file system. The full path where all data and related information of the search collection is stored.
Collection Description:	Description of the selected search collection if available.
Search Collection Language:	Language for which the search collection and its index are optimized. The index uses this language to analyze the documents when indexing, if no other language is specified for the document. This feature enhances the quality of search results for users, as it allows them to use spelling variants, including plurals and inflections, for the search keyword.
Summarizer used:	Whether a static summarizer is enabled for this search collection. The static summarizer creates a summary of the page, which is based on the page's full content. The page's full content can include metadata, HTML elements, and Web Content Manager templates. These additional elements might be interpreted as text and thus become a part of the page's summary. Do not use the static summarizer if the page's summary contains a large amount of noise from these additional elements.
Last update completed:	Date when a content source defined for the search collection was last updated by a scheduled update.
Next update scheduled:	Date when the next update of a content source defined for the search collection is scheduled.
Number of active documents:	Number of active documents in the search collection, that is, all documents available for search by users.