Portal, Express Beta Version 6.1
Operating systems: i5/OS, Linux,Windows |
The seedlist crawler is a special HTTP crawler that can be used to crawl external sites which publish their content using the seedlist format. The seedlist format is an ATOM/XML-based format specifically for publishing application content, including all its metadata. The format supports publishing only updated content between crawling sessions for more effective crawling. You can configure the seedlist crawler with general parameters, filters and schedulers, then run the crawler.
Before configuring the seedlist crawler, collect the following information:The seedlist page is a special ATOM/XML page containing metadata that directs the crawler to the actual links that should be fetched and indexed to become searchable later. The seedlist page also contains document level metadata that is stored along with the document in the search index. In order to make seedlist crawler results searchable, provide the crawler with a URL to a page containing a seedlist. The crawler retrieves the seedlist and crawls the pages indicated by the seedlist.
Parameter name | Parameter value description | Required? |
---|---|---|
Content Source Name | Enter a name that will help you remember the seedlist source being crawled | no |
Collect documents linked from this URL | Root URL, the URL of the seedlist page | yes |
Levels of links to follow | Use the drop-down menu to select how many levels of pages the crawler will follow from the seedlist | no
default value is 1 |
Number of documents to collect | Sets a maximum number of linked documents to collect | no
default value is unlimited |
Force complete crawl | Indicates whether the crawler needs to fetch only updates from the seedlist provider, or the full list of content. When checked, the crawler will request the full list of content items. When unchecked, the crawler will request only the list of updates. | no
default value is checked |
Stop collecting after | Indicates in minutes the maximum time interval the crawler should operate | no |
Stop fetching a document after | Indicates in seconds how much time the crawler will spend trying to fetch a document | no |
Links expire after | Indicates in days when links expire and need to be refreshed by the crawler | no
default value is unlimited option is not available if Force complete re-gathering is not selected |
Remove broken links after | Indicates in days when broken links should be removed | no
default value is 10 option is not available if Force complete re-gathering is not selected |
Parameter name | Parameter value description | Required? |
---|---|---|
User name | User ID used by the crawler to authenticate the seedlist page | yes |
Password | Password used by the crawler to authenticate the seedlist page | yes |
Host Name | The name of the server referenced by the seedlist. This is not required, and if left blank, the host name is inferred from the seedlist root URL. | no |
Realm | Fill in the realm of the secured content source or repository. | no |