Start HTTP Crawler (STRHTTPCRL)

Where allowed to run: All environments (*ALL)
Threadsafe: No
Parameters
Examples
Error messages

The Start HTTP Crawling (STRHTTPCRL) command allows you to create or append to a document list by crawling remote web sites, downloading files found, and saving the path names in the document list specified.

To create a document list, specify *CRTDOCL for the Option (OPTION) parameter.

To update a document list, specify *UPDDOCL for the OPTION parameter.

Top


 

Parameters

Keyword Description Choices Notes
OPTION Option *CRTDOCL, *UPDDOCL Required, Positional 1
METHOD Crawling method *OBJECTS, *DETAIL Optional
OBJECTS URL and options objects Element list Optional
Element 1: URL object Character value
Element 2: Options object Character value
DOCLIST Document list file Path name Optional
DOCDIR Document storage directory Path name, '/QIBM/USERDATA/HTTPSVR/INDEX/DOC' Optional
LANG Language of documents *ARABIC, *BALTIC, *CENTEUROPE, *CYRILLIC, *ESTONIAN, *GREEK, *HEBREW, *JAPANESE, *KOREAN, *SIMPCHINESE, *TRADCHINESE, *THAI, *TURKISH, *WESTERN Optional
URL URL Character value Optional
URLFTR URL filter Character value, *NONE Optional
MAXDEPTH Maximum crawling depth 0-100, 3, *NOMAX Optional
ENBROBOT Enable robots *YES, *NO Optional
PRXSVR Proxy server for HTTP Character value, *NONE Optional
PRXPORT Proxy port for HTTP 1-65535 Optional
PRXSVRSSL Proxy server for HTTPS Character value, *NONE Optional
PRXPORTSSL Proxy port for HTTPS 1-65535 Optional
MAXSIZE Maximum file size 1-6000, 1000 Optional
MAXSTGSIZE Maximum storage size 1-65535, 100, *NOMAX Optional
MAXTHD Maximum threads 1-50, 20 Optional
MAXRUNTIME Maximum run time Single values: *NOMAX
Other values: Element list
Optional
Element 1: Hours 0-1000, 2
Element 2: Minutes 0-59, 0
LOGFILE Logging file Path name, *NONE Optional
CLRLOG Clear logging file *YES, *NO Optional
VLDL Validation list Name, *NONE Optional

Top

 

Option (OPTION)

Specifies the document list task to perform.

This is a required parameter.

*CRTDOCL

Create a document list. If the file already exists, it will be replaced.

*UPDDOCL

Append additional document paths to a document list.

Top

 

Crawling method (METHOD)

Specifies the crawling method to use.

*DETAIL

Use specific values for crawling remote web sites such as the document storage directory, a URL, and a URL filter. These are the same values that are contained in a URL object and an options object.

*OBJECTS

Use a URL object and an options object for crawling web sites. These objects contain specific values used in the crawling process.

Top

 

URL and options objects (OBJECTS)

Specifies the objects to use for crawling. Both must be specified. Use the Configure HTTP Search (CFGHTTPSCH) command to create the objects.

Element 1: URL object

character-value

Specify the name of the URL object to use.

Element 2: Options object

character-value

Specify the name of the options object to use.

Top

 

Document list file (DOCLIST)

Specifies the document list file to hold the path names of the documents found by crawling remote web sites.

path-name

Specify the document list file path name.

Top

 

Document storage directory (DOCDIR)

Specifies the directory to use to store the documents that are downloaded.

'/QIBM/USERDATA/HTTPSVR/INDEX/DOC'

This directory is used to store the downloaded documents.

path-name

Specify the document storage directory path name.

Top

 

Language of documents (LANG)

Specifies the language of the documents that are to be downloaded. These language choices are similar to the character sets or encodings that can be selected on a browser.

*WESTERN

The documents are in a Western language such as English, Finnish, French, Spanish, or German.

*ARABIC

The documents are in Arabic.

*BALTIC

The documents are in a Baltic language such as Latvian or Lithuanian.

*CENTEUROPE

The documents are in a Central European language such as Czech, Hungarian, Polish, Slovakian, or Slovenian.

*CYRILLIC

The documents are in a Cyrillic language such as Russian, Ukranian, or Macedonian.

*ESTONIAN

The documents are in Estonian.

*GREEK

The documents are in Greek.

*HEBREW

The documents are in Hebrew.

*JAPANESE

The documents are in Japanese.

*KOREAN

The documents are in Korean.

*SIMPCHINESE

The documents are in Simplified Chinese.

*TRADCHINESE

The documents are in Traditional Chinese.

*THAI

The documents are in Thai.

*TURKISH

The documents are in Turkish.

Top

 

URL (URL)

Specifies the name of the URL (Universal Resource Locator) to crawl.

character-value

Specify the URL to crawl.

Top

 

URL filter (URLFTR)

The domain filter to limit sites crawled to those within the specified domain.

*NONE

No filtering will be done base on domain.

character-value

Specify the domain filter to limit crawling.

Top

 

Maximum crawling depth (MAXDEPTH)

The maximum depth to crawl from the starting URL. Zero means to stop crawling at the starting URL site. Each additional layer refers to following referenced links within the current URL.

3

Referenced links will be crawled three layers deep.

*NOMAX

Referenced links will be crawled regardless of depth.

0-100

Specify the maximum crawling depth.

Top

 

Enable robots (ENBROBOT)

Specifies whether to enable support for robot exclusion. If you select to support robot exclusion, any site or pages that contain robot exclusion META tags or files will not be downloaded.

*YES

Enable support for robot exclusion.

*NO

Do not enable support for robot exclusion.

Top

 

Proxy server for HTTP (PRXSVR)

Specifies the HTTP proxy server to be used.

*NONE

Do not use an HTTP proxy server.

HTTP-proxy-server

Specify the name of the HTTP proxy server.

Top

 

Proxy port for HTTP (PRXPORT)

Specifies the HTTP proxy server port.

1-65535.

Specify the number of the HTTP proxy server port. This parameter is required if a proxy server name is specified for the Proxy server for HTTP (PRXSVR) parameter.

Top

 

Proxy server for HTTPS (PRXSVRSSL)

Specifies the HTTPS proxy server for using SSL support.

*NONE

Do not use an HTTPS proxy server.

character-value

Specify the name of the HTTPS proxy server for SSL support.

Top

 

Proxy port for HTTPS (PRXPORTSSL)

Specifies the HTTPS proxy server port for SSL support.

1-65535

Specify the number of the HTTPS proxy server port for SSL support. This is required if an SSL proxy server is also specified. This parameter is required if a proxy server name is specified for the Proxy server for HTTPS (PRXSVRSSL) parameter.

Top

 

Maximum file size (MAXSIZE)

Specifies the maximum file size, in kilobytes, to download.

1000

Download files that are no greater than 1000 kilobytes.

*NOMAX

Files will be downloaded regardless of size.

1-6000.

Specify the maximum file size to download, in kilobytes.

Top

 

Maximum storage size (MAXSTGSIZE)

Specifies the maximum storage size, in megabytes, to allocate for downloaded files. Crawling will end when this limit is reached.

100

Up to 100 megabytes of storage will be used for downloaded files.

*NOMAX

No maximum storage size for downloaded files.

1-65535.

Specify the maximum storage size, in megabytes, for downloaded files.

Top

 

Maximum threads (MAXTHD)

Specifies the maximum number of threads to start for crawling web sites. Set this value based on the system resources that are available.

20

Start up to 20 threads for web crawling.

1-50.

Specify the maximum number of threads to start.

Top

 

Maximum run time (MAXRUNTIME)

Specifies the maximum time for crawling to run, in hours and minutes.

Single values

*NOMAX

Run the crawling session until it completes normally or is ended by using the ENDHTTPCRL (End HTTP Crawler) command.

Element 1: Hours

2

Run the crawling session for 2 hours plus the number of minutes specified.

0-1000.

Specify the number of hours to run the crawling session.

Element 2: Minutes

0

Run the crawling session for the number of hours specified.

*SAME

Use this value when you are updating the options object, but want to use the same maximum number of minutes to run.

0-59.

Specify the number of minutes to run the crawling session. The crawling session will run for the number of hours specified in the first element of this parameter plus the number of minutes specified.

Top

 

Logging file (LOGFILE)

Specifies the activity logging file to be used. This file contains information about the crawling session plus any errors that occur during the crawling session. This file must be in a directory.

*NONE

Do not use an activity log file.

path-name

Specify the path name of the logging file.

Top

 

Clear logging file (CLRLOG)

Specifies whether to clear the activity log file before starting the crawling session.

*YES

Always clear the activity log file before each crawling session.

*NO

Do not clear the activity log file.

Top

 

Validation list (VLDL)

Specifies the validation list to use for SSL sessions. Use the Configure HTTP Search (CFGHTTPSCH) command to create a validation list object.

*NONE

Do not use a validation list object.

name

Specify the name of the validation list.

Top


 

Examples

  STRHTTPCRL OPTION(*CRTDOCL) DOCLIST('/mydir/my.doclist')
    URL('http://www.ibm.com') MAXDEPTH(2)

This command starts a new crawling session finding referenced links 2 layers from the starting URL at www.ibm.com. The document list will be created in '/mydir/my.doclist' and will contain sets of a local directory path, for example, '/QIBM/USERDATA/HTTPSVR/INDEX/DOC/www.ibm.com/us/index.html' and the actual URL to the page 'http://www.ibm.com/us/'. Use the Configure HTTP Search (CFGHTTPSCH) command to create an index using this document list.

Top


 

Error messages

*ESCAPE Messages

HTP160C

Request to create or append to a document list failed. Reason &1.

HTP166E

Request to print the status of a document list failed. Reason &1.

Top