Setting up a document list for the Webserver search engine on HTTP Server

Setting up a document list for the Webserver search engine on HTTP Server

In the IBM HTTP Server for i5/OS, you can create a document list for the Webserver search engine with the IBM Web Administration for i5/OS interface.
Information for this topic supports the latest PTF levels for HTTP Server for iSeries . IBM recommends that you install the latest PTFs to upgrade to the latest level of the HTTP Server for i5/OS. Some of the topics documented here are not available prior to this update. See IBM Service for more information.
A document list is a file that contains a list of documents used to create or update a search index. When a request for a search title or description is sent, it is compared to the document list for possible matches.
To set up a document for use with the Webserver search engine, complete the following steps:

Parent topic:
Search tasks

Create a document list

To create a document list, do the following:

Click the Advanced tab.

Click the Search Setup subtab.

Expand Search Engine Setup.

Click Build document list.

Choose one of the two options:

Build a document list from documents on this server

Select this option if the documents to be included in the document list are in a local directory.

Build the document list by crawling a URL

Select this option if the documents to be included in the document list reside in a remote server.

There are two additional options if you choose to build the document list using the Web crawler. These are:

Build the document list by crawling a URL

Select this option to crawl a single URL.

Build the document list from selected URL and options objects

Select this option only if you have previously created a Setting up a URL object for the Webserver search engine on HTTP Server and an Setting up an options object for the Webserver search engine on HTTP Server.

Click Apply.

Build a document list from documents on this server

If you opted to build a document list from a local directory, follow these instructions to complete your document list:

Choose one of the two document list file name options:

Create a new document list file

Select this option to create a new document list file. Replace the asterisk (*) with a new name for your document list file.

Use the document list in this file

Select this option to use an existing document list file. Select the document list file from the list.

There are two additional options if you choose to use an existing document list file. These are:

Replace the document list file

Select this option to overwrite the existing document list file.

Append the new list to the document list file

Select this option to add any new information to the existing document list file. This option will not delete existing information.

Enter the directory the document list will build from in the Build a document list from this directory field. For example, /www/mydocs/public/info.
There are two additional options that you may select. These are:

Traverse subdirectories in this directory

Select to include any documents in subdirectories of the directory you provided in the field above.

Document filter

Select this option if you want the document list to be made of specific file types. For example, entering *.htm* will only build a document list of file types htm and html.

Click Apply.

Build the document list by crawling a URL

If you opted to build a document list with the Web crawler that will crawl a URL, follow these instructions to complete your document list:

Choose one of the two document list file name options:

Create a new document list file

Select this option to create a new document list file. Replace the asterisk (*) with a new name for your document list file.

Use the document list in this file

Select this option to use an existing document list file. Select the document list file from the list.

There are two additional options if you choose use an existing document list file. These are:

Replace the document list file

Select this option to overwrite the existing document list file.

Append the new list to the document list file

Select this option to add any new information to the existing document list file. This option will not delete existing information.

Enter the Web crawler options:

URL

Enter the URL the Web crawler will visit to add documents to your document list. For example, http://www.ibm.com.

URL domain filter

Enter the URL domain filter the Web crawler will stay on. For example, ibm.com^®.

Maximum crawling depth

Enter the depth of the crawling from the starting URL. For example, entering a depth of 0 will download only the starting URL page. Selecting a depth of 1, will continue the crawl to the first layer of links from the starting URL.

Support robot exclusion

If you select Yes, any site or pages that contain robot exclusion META tags or files will not be downloaded. Excluded files do not usually contain HTML or text. See Managing Web spiders, Web crawlers, and robots on HTTP Server for more information.

Choose crawling options:

Directory to store documents

Enter the directory to store the documents the Web crawler finds. For example, /www/mydocs/public/crawl.

Document language

Select the language of the documents being retrieved by the Web crawler.

Proxy server for HTTP

Enter the proxy server for HTTP requests. Possible values include any valid server name.

Proxy port for HTTP

Enter the port number for the above proxy server. A port is required if a proxy server for HTTP is specified.

Proxy server for HTTPS

Enter the proxy server for HTTPS requests.

Proxy port for HTTPS

Enter the port number for the above proxy server.

Maximum file size to download

Enter the maximum size for a downloaded file (in KB).

Maximum storage for files

Enter the maximum storage space for all downloaded files (in MB).

Maximum threads

Enter the maximum number of threads used during web crawling. Set this value based on the system resources that are available.

Maximum run time

Enter the maximum amount of time the crawling session remains active in hours and minutes.

Activity log file

Enter the action to take for an activity log file. This file contains information about the crawling session plus any errors that occur. This file must be in a directory of the IFS. You can choose to run a crawling session with or without an activity log file. You also have the option of replacing the log file each time a crawling session is started or appending information to the existing file.

There are two additional options if you choose to write an activity log. These are:

Create or replace the logging file

Select this option if the log file does not exist or you want to overwrite an existing log file.

Append to the existing logging file

Select this option to add any new information to the existing log file. This option will not delete existing information.

Click Apply.

Build the document list from selected URL and options objects

If you opted to build a document list with the Web crawler using selected URL and options objects, follow these instructions to complete your document list:

Choose one of the two document list file name options:

Create a new document list file

Select this option to create a new document list file. Replace the asterisk (*) with a new name for your document list file.

Use the document list in this file

Select this option to use an existing document list file. Select the document list file from the list.

There are two additional options if you choose use an existing document list file. These are:

Replace the document list file

Select this option to overwrite the existing document list file.

Append the new list to the document list file

Select this option to add any new information to the existing document list file. This option will not delete existing information.

Select the Setting up a URL object for the Webserver search engine on HTTP Server.

Select the Setting up an options object for the Webserver search engine on HTTP Server.

Select Setting up validation lists for the Webserver search engine on HTTP Server:

Validation list

Select Do not use a validation list if you know the server the Web crawler will visit does not use a validation list for authentication. Otherwise, select Use this validation list for sites requiring a userid and password and select the validation list to be used from the list.

Click Apply.