Search and crawl
Configure the local portal site, and crawl remote portal sites, so that they are searchable by users. Run crawlers against other, external Web sites to make them searchable by local portal users.Users of the portal can search across various types of sites. In addition to searching the local portal site, we can crawl remote portal sites, and external Web sites, to make search results from those sites available to the local portal users. Examples of search scenarios include:
- Users of the portal search our own local portal site. This can include public and secure pages of the portal.
- Users of the portal search the Web Content Manager collection provided with the portal. This includes all Web Content Manager sites and libraries,
- Users of the portal site search other portal sites. This works only for public pages of the other portals.
- Users of the portal search external Web sites such as yahoo.com or google.com or cnn.com. When you run a crawler against external Web sites, we can collect and display external search results next to results from the local portal site.
- External users search the portal site. This works only for public pages of the portal.
Search the local portal
View information on setting up the local portal for your users to search.
The portal default search collection combines two content sources and their related crawlers:
- The Portal Content Source. This contains the local portal site, where users can search for portal pages and portlets.
- The Web Content Manager (WCM) Content Source, which users can search for web content.
Reset the default search collection
Under certain circumstances we might want to change the configuration of the portal site search collection. In this case you need to recreate the collection, as search collections cannot be modified.
The portal site default search collection is created at the first time when an administrator navigates to the search administration portlet Manage Search. This requires considerations about the configuration tasks related to the portal and Portal Search and about the sequence by which you perform these tasks. An example scenario might bto perform a portal database transfer, for example, from the default database to a different database. In this case create the portal site collection by navigating to the Manage Search portlet before you transfer the database. Otherwise the portal site collection will not be available after the database transfer.
If you created the portal site collection by navigating to the Manage Search portlet before you completely configured the portal and Portal Search, we might need to recreate the search collection. Example scenarios are as follows:
- If the preferred language for the portal site crawler user ID did not match the language of the portal site search collection.
- If you decide to change the default directory location for search collections in the portal installation.
- If the file path length for search collections exceeds its limit of 118 characters, the collection cannot be created. In this case specify a shorter value for the parameter DefaultCollectionsDirectory. For details about how to configure this parameter refer to Configure the default location for search collections. This file path length problem can occur particularly when the portal site collection is created under UNIX operating systems. For details about this refer to Create the portal site search collection fails.
- If you do not want summary information to be generated for the portal and web content and you therefore want to turn the summarizer off.
- To change the name of the search collection.
In such a scenario...
- Perform the required configuration tasks, for example, for the language or path settings.
- Create a new search collection with the appropriate configuration settings.
- Export the content sources from the default search collection.
In a default portal installation these are the Portal Content Source, which contains portal pages and portlets, and the WCM Content Source, which contains web content.
- Import these exported content sources into the new search collection. Portal Search or the Manage Search portlet help.
- We can now delete the default search collection.
Portal Search performs a new crawl on the portal site search collection.
- On a multilingual portal site we can create multiple collections in different languages. For details refer to Crawl a multilingual portal site.
- When you start the crawl for the first time, this might result in a warning message. We can ignore this message.
Crawl a remote portal site
Configure Portal Search to crawl and index a remote, public portal site.
We can enable search on other portal sites. However, only the public pages of other portals can be searched.
To have Portal Search crawl and index a public portal site:
- Create a new content source using the Manage Search portlet.
- Select Web site from the pull-down menu.
- Enter the URL of the portal sito to make available for search by the users.
When you start the crawl, the public portion of the portal site is crawled. The search collection will only contain public pages.
Crawl an external site using a seedlist provider
The seedlist crawler is a special HTTP crawleused to crawl external sites which publish their content using the seedlist format. The seedlist format is an ATOM/XML-based format specifically for publishing application content, including all its metadata. The format supports publishing only updated content between crawling sessions for more effective crawling. Configure the seedlist crawler with general parameters, filters and schedulers, then run the crawler.
Before configuring the seedlist crawler, collect the following information:
- Root URL, which is the URL of the seedlist page.
The seedlist page is a special ATOM/XML page containing metadata that directs the crawler to the actual links that should be fetched and indexed to become searchable later. The seedlist page also contains document level metadata stored along with the document in the search index. To make seedlist crawler results searchable, you must provide the crawler with a URL to a page containing a seedlist. The crawler retrieves the seedlist and crawls the pages indicated by the seedlist.
- User ID and Password, used by the crawler to authenticate the seedlist page.
To configure and create the seedlist crawler:
- Click...
Manage Search | Search Services | Portal Search Service | search_collection | New Content Source | Content source type | Seedlist provider
- Under the tabs General Parameters, Advanced parameters, Schedulers and Security, provide the information in the fields and select options as required. For details refer to the topic Manage and administer Portal Search.
- Click Create.
- To run the crawler, click the start crawler icon (right-pointing arrow) next to the content source name on the Content Sources page. If you have defined a crawler schedule under the Schedulers tab, the crawler will start at the next possible time specified.
Parent: Portal Search
Related: Manage and administer Portal Search
Configure a crawler to search the local portal site
Crawl a multilingual portal site
Configure search on a secured portal site
Configure the default location for search collections
Crawl a multilingual portal site
Tips for using Portal Search
Manage and administer Portal Search
Related reference:
Apply filter rules