Crawl an external site using a seed list

Crawl an external site using a seed list

The seed list crawler is a special HTTP crawler that can be used to crawl external sites which publish their content using the seed list format. The seed list format is an ATOM/XML-based format specifically for publishing application content, including all its metadata. The format supports publishing only updated content between crawling sessions for more effective crawling. You can configure the seed list crawler with general parameters, filters and schedulers, then run the crawler.
Before configuring the seed list crawler, collect the following information:

Root URL, which is the URL of the seed list page.
The seed list page is a special ATOM/XML page containing metadata that directs the crawler to the actual links that should be fetched and indexed to become searchable later. The seed list page also contains document level metadata that is stored along with the document in the search index. In order to make seed list crawler results searchable, provide the crawler with a URL to a page containing a seed list. The crawler retrieves the seed list and crawls the pages indicated by the seed list.

User ID and Password, which are used by the crawler to authenticate the seed list page.
To configure and create the seed list crawler:

Click Manage Search -> Search Services.

Click the relevant Portal Search Service.

Click the name of an existing search collection, or create a new search collection.

Click New Content Source.

Click the drop-down menu icon next to Content source type and click seed list Feed to indicate that the content source is a seed list.

Under the General Parameters tab, provide required and optional information in the following fields:
Description of the parameter names under the General Parameters tab and whether or not it is required

Parameter name Parameter value description Required?
Content Source Name Enter a name that will help you remember the seed list source being crawled no
Collect documents linked from this URL Root URL, the URL of the seed list page yes
Levels of links to follow Use the drop-down menu to select how many levels of pages the crawler will follow from the seed list no
default value is 1

Number of documents to collect Sets a maximum number of linked documents to collect no
default value is unlimited

Force complete crawl Whether the crawler needs to fetch only updates from the seed list feed or from the full list of content. When checked, the crawler will request the full list of content items. When unchecked, the crawler will request only the list of updates. no
default value is checked

Stop collecting after Indicates in minutes the maximum time interval the crawler should operate no
Stop fetching a document after Indicates in seconds how much time the crawler will spend trying to fetch a document no
Links expire after Indicates in days when links expire and need to be refreshed by the crawler no
default value is unlimited
option is not available if Force complete regathering is not selected

Remove broken links after Indicates in days when broken links should be removed no
default value is 10
option is not available if Force complete regathering is not selected

Under the Schedulers tab, set how often the crawler should run to update the search content.

Set the date when the crawler should start running.

Set the time of day when the crawler should run.

Set the update interval.

Click Create.

Under the Filters tab, you can define rules that control how the crawler collects documents and adds them to the search index. You can include or exclude based on the document's URL. Details about the filtering settings are provided in Apply filter rules.
The filters are not effective for the links inside the seed list.

Under the Security tab, provide required and optional information in the following fields:
Description of the parameter names under the Security tab and whether or not they are required

Parameter name Parameter value description Required?
User name User ID used by the crawler to authenticate the seed list page yes
Password Password used by the crawler to authenticate the seed list page yes
Host Name The name of the server referenced by the seed list. This is not required, and if left blank, the host name is inferred from the seed list root URL. no
Realm Enter the realm of the secured content source or repository. no

Click Create.

To run the crawler, click the start crawler icon (right-pointing arrow) next to the content source name on the Content Sources page.
If you have defined a crawler schedule under the Schedulers tab, the crawler will start at the next possible time that you specified.

Parent
Search and crawling portal and other sites
Apply filter rules
Submitted by Michel Jonker on Jun 10, 2011 5:43:47 AM
Re: Crawling an external site using a seedlist: wp7
I have done all that, but what to do when the crawler only adds one entry to the collections and this entry is the seedlist.

+
Search Tips | Advanced Search

Parameter name	Parameter value description	Required?
Content Source Name	Enter a name that will help you remember the seed list source being crawled	no
Collect documents linked from this URL	Root URL, the URL of the seed list page	yes
Levels of links to follow	Use the drop-down menu to select how many levels of pages the crawler will follow from the seed list	no default value is 1
Number of documents to collect	Sets a maximum number of linked documents to collect	no default value is unlimited
Force complete crawl	Whether the crawler needs to fetch only updates from the seed list feed or from the full list of content. When checked, the crawler will request the full list of content items. When unchecked, the crawler will request only the list of updates.	no default value is checked
Stop collecting after	Indicates in minutes the maximum time interval the crawler should operate	no
Stop fetching a document after	Indicates in seconds how much time the crawler will spend trying to fetch a document	no
Links expire after	Indicates in days when links expire and need to be refreshed by the crawler	no default value is unlimited option is not available if Force complete regathering is not selected
Remove broken links after	Indicates in days when broken links should be removed	no default value is 10 option is not available if Force complete regathering is not selected

Parameter name	Parameter value description	Required?
User name	User ID used by the crawler to authenticate the seed list page	yes
Password	Password used by the crawler to authenticate the seed list page	yes
Host Name	The name of the server referenced by the seed list. This is not required, and if left blank, the host name is inferred from the seed list root URL.	no
Realm	Enter the realm of the secured content source or repository.	no