Managing the content sources of a search collection
This topic describes how you manage the content sources of a search
collection.
To work with content sources of a collection select . Then select a search
collection by clicking the collection name link. Portal Search displays the Content
Sources panel. It shows the status of the selected search collection and
lists its content sources and their status. It shows information related to
the individual content sources, and lets you perform tasks on these content
sources.
You can select the following option icons and perform the
following tasks in relation to the search collection which you selected from
the Search Collections list:
- New Content Source. Use this option to create a
new content source for the search collection that you selected from the Search
Collections list. For detailed instructions refer to the portlet help. Notes:
- You can configure a search collection to cover multiple content source
of different types. For example, you can combine portal sites, Web sites,
and local document collections.
- The selectable options and data entry fields that are displayed under
the different configuration tabs depend on which type of content source you
select.
- If you select the radio button for Portal site, the appropriate
data for your portal site is already filled in.
- For some content sources you might need to enter sensitive data, such
as a user ID and password. For example, this applies to secured WebSphere Portal Express sites
or HTTP sites that require a user ID and password. To ensure encryption of
this sensitive data when it is stored, update and run the file searchsecret.xml using
the XML configuration interface before creating the content source.
For details about how to do this, refer to Encrypting sensitive data.
- Under the General parameters tab you have to set
the URL for the content source in a field Collect documents linked from
this URL: . The crawler needs this URL for crawling. Note: A crawler failure
can be caused by URL redirection problems. If this occurs, try by editing
this field accordingly, for example, by changing the URL to the redirected
URL.
- Under the General parameters tab you can set a
timeout for crawling a content source under the option Stop collecting
after (minutes): . This timeout works as follows:
- The timeout works as a fuzzy time limit. It might be exceeded by some
percentage.
- The crawl action is put in a queue. It might therefore take several minutes
until it is executed and the time counter starts. It might therefore seem
that the crawl takes longer than the timeout that you set.
Therefore, when you start the crawl by clicking Start Crawler,
allow for some time tolerance and be aware of the Time required for crawls and imports and availability of documents.
- Under the Advanced Parameter tab, the entry field
for the Default Character Encoding contains the initial default value
windows-1252, regardless of the setting for
the Default Portal Language under . Enter the required default character encoding, depending on
your portal language. Otherwise documents might be displayed incorrectly under
Browse Documents.
- Before you start the crawl, set the preferred language of the crawler
user ID to match the language of the search collection that it crawls.
- You start the initial crawl on a newly created content source by either
of the following options:
- After you created a new content source, click the Start Crawler icon.
This starts an immediate crawl.
- When you create the content source, define a schedule under the Schedulers tab.
The crawl will start at the next possible time that you specified.
- Refresh. Use this option to update the list of
content sources and the status shown for this collection.
- Select the following option icons and perform the following tasks on a
content source:
- View Content Source Schedulers. Use this option
to view and manage schedulers. This option is only available if you have defined
schedulers for the content source.
- Start crawler. Use this option to start collecting
documents and thereby the crawling of a content source. Use this option to
start an update of a content source by a new run of the crawler, or stop such
an update. The timeout that you set under the General Parameters tab
for crawling a content source works as a fuzzy time limit. It might be exceeded
by some percentage. Therefore allow some tolerance.
- Verify Address of Content Source. Use this option
to verify that the URL of the content source is still live and available.
Manage Search returns a message about the status of the content source.
- Edit Content Source. Use this option to make changes
to the content source, that is, configure parameters, schedules, categories,
and filters for the selected content source. Note: If you modify a content
source that belongs to a search scope, update the scope manually to make sure
that the scope still covers that content source. Especially if you changed
the name of the content source, edit the scope and make sure that it is still
listed there. If not, add it again.
- Delete Content Source. Use this option to delete
the content source. Note: If you delete a content source, then
the documents that were collected from this content source will remain available
for search by users under all scopes which included the content source before
it was deleted. These documents will be available until their expiration time
ends. This expiration time can be specified under Links expire
after (days): under General Parameters when
you created the content source.
- View information about the status and configuration of the content
source.Note: To update the status information, click the Refresh button
or the refresh button of the browser.
- View the Collection Status information of
the selected search collection. The status fields show the following data
that changes over the lifetime of the search collection:
- Search Collection Name:
- Shows the name of the selected search collection.
- Search Collection Location:
- Shows the location of the selected search collection in the file system.
This is the full path where all data and related information of the search
collection is stored.
- Collection Description:
- Shows the description of the selected search collection if available.
- Search Collection Language:
- Shows the language for which the search collection and its index are optimized.
The index uses this language to analyze the documents when indexing, if no
other language is specified for the document. This feature enhances the quality
of search results for users, as it allows them to use spelling variants, including
plurals and inflections, for the search keyword. For more information refer
to Language support for Portal Search.
- Categorizer used:
- Shows the categorizer that is used by the search collection. For more
information about categorizers refer to Categorizers and taxonomies and
the related subtopics. For more information about how to work with a rule-based
categorizer for a search collection, refer to User-defined rule-based categorizer and
to the Manage Search portlet help.
- Summarizer used:
- Shows whether a static summarizer is enabled for this search collection.
For information about the summarizer refer to Summarizer.
- Remove common words from queries:
- Shows whether the indexer and the search filter out common words from
documents, such as and, the, of.
- Last update completed:
- Shows the date when a content source defined for the search collection
was last updated by a scheduled update.
- Next update scheduled:
- Shows the date when the next update of a content source defined for the
search collection is scheduled.
- Number of active documents:
- Shows the number of active documents in the search collection, that is,
all documents that are available for search by users.
Notes:
- To update the status information, click Refresh.
Clicking the refresh button of the browser will not update the status information.
- If you delete a portlet from the portal after a crawl of the portal site,
the deleted portlet is no longer listed in the search results. However, refreshing
the view does not update the status information about the Number of active
documents. This information is not updated until after the next cleanup
run of portal resources. For details
about the cleanup service refer to Delayed cleanup of deleted portal pages.
For more details about the available options for content sources, refer
to the Manage Search portlet help.Note: When you start
a crawl, be aware of the Memory required for crawls and
the Time required for crawls and imports and availability of documents
- Applying filter rules
Portal Search provides a facility for applying filter rules to the crawler process. The crawler filters control the crawler progress and the type of documents that are indexed and cataloged.
Parent topic: Set up search collections
Related tasks
Exporting and importing search collections
Related reference
Creating and configuring search collections
|
|
|