Start HTTP Crawler (STRHTTPCRL)
Where allowed to run: All environments (*ALL)
Threadsafe: NoParameters
Examples
Error messagesThe Start HTTP Crawling (STRHTTPCRL) command allows you to create or append to a document list by crawling remote web sites, downloading files found, and saving the path names in the document list specified.
To create a document list, specify *CRTDOCL for the Option (OPTION) parameter.
To update a document list, specify *UPDDOCL for the OPTION parameter.
Top
Parameters
Keyword Description Choices Notes OPTION Option *CRTDOCL, *UPDDOCL Required, Positional 1 METHOD Crawling method *OBJECTS, *DETAIL Optional OBJECTS URL and options objects Element list Optional Element 1: URL object Character value Element 2: Options object Character value DOCLIST Document list file Path name Optional DOCDIR Document storage directory Path name, '/QIBM/USERDATA/HTTPSVR/INDEX/DOC' Optional LANG Language of documents *ARABIC, *BALTIC, *CENTEUROPE, *CYRILLIC, *ESTONIAN, *GREEK, *HEBREW, *JAPANESE, *KOREAN, *SIMPCHINESE, *TRADCHINESE, *THAI, *TURKISH, *WESTERN Optional URL URL Character value Optional URLFTR URL filter Character value, *NONE Optional MAXDEPTH Maximum crawling depth 0-100, 3, *NOMAX Optional ENBROBOT Enable robots *YES, *NO Optional PRXSVR Proxy server for HTTP Character value, *NONE Optional PRXPORT Proxy port for HTTP 1-65535 Optional PRXSVRSSL Proxy server for HTTPS Character value, *NONE Optional PRXPORTSSL Proxy port for HTTPS 1-65535 Optional MAXSIZE Maximum file size 1-6000, 1000 Optional MAXSTGSIZE Maximum storage size 1-65535, 100, *NOMAX Optional MAXTHD Maximum threads 1-50, 20 Optional MAXRUNTIME Maximum run time Single values: *NOMAX
Other values: Element listOptional Element 1: Hours 0-1000, 2 Element 2: Minutes 0-59, 0 LOGFILE Logging file Path name, *NONE Optional CLRLOG Clear logging file *YES, *NO Optional VLDL Validation list Name, *NONE Optional
Top
Option (OPTION)
Specifies the document list task to perform.
This is a required parameter.
- *CRTDOCL
- Create a document list. If the file already exists, it will be replaced.
- *UPDDOCL
- Append additional document paths to a document list.
Top
Crawling method (METHOD)
Specifies the crawling method to use.
- *DETAIL
- Use specific values for crawling remote web sites such as the document storage directory, a URL, and a URL filter. These are the same values that are contained in a URL object and an options object.
- *OBJECTS
- Use a URL object and an options object for crawling web sites. These objects contain specific values used in the crawling process.
Top
URL and options objects (OBJECTS)
Specifies the objects to use for crawling. Both must be specified. Use the Configure HTTP Search (CFGHTTPSCH) command to create the objects.
Element 1: URL object
- character-value
- Specify the name of the URL object to use.
Element 2: Options object
- character-value
- Specify the name of the options object to use.
Top
Document list file (DOCLIST)
Specifies the document list file to hold the path names of the documents found by crawling remote web sites.
- path-name
- Specify the document list file path name.
Top
Document storage directory (DOCDIR)
Specifies the directory to use to store the documents that are downloaded.
- '/QIBM/USERDATA/HTTPSVR/INDEX/DOC'
- This directory is used to store the downloaded documents.
- path-name
- Specify the document storage directory path name.
Top
Language of documents (LANG)
Specifies the language of the documents that are to be downloaded. These language choices are similar to the character sets or encodings that can be selected on a browser.
- *WESTERN
- The documents are in a Western language such as English, Finnish, French, Spanish, or German.
- *ARABIC
- The documents are in Arabic.
- *BALTIC
- The documents are in a Baltic language such as Latvian or Lithuanian.
- *CENTEUROPE
- The documents are in a Central European language such as Czech, Hungarian, Polish, Slovakian, or Slovenian.
- *CYRILLIC
- The documents are in a Cyrillic language such as Russian, Ukranian, or Macedonian.
- *ESTONIAN
- The documents are in Estonian.
- *GREEK
- The documents are in Greek.
- *HEBREW
- The documents are in Hebrew.
- *JAPANESE
- The documents are in Japanese.
- *KOREAN
- The documents are in Korean.
- *SIMPCHINESE
- The documents are in Simplified Chinese.
- *TRADCHINESE
- The documents are in Traditional Chinese.
- *THAI
- The documents are in Thai.
- *TURKISH
- The documents are in Turkish.
Top
URL (URL)
Specifies the name of the URL (Universal Resource Locator) to crawl.
- character-value
- Specify the URL to crawl.
Top
URL filter (URLFTR)
The domain filter to limit sites crawled to those within the specified domain.
- *NONE
- No filtering will be done base on domain.
- character-value
- Specify the domain filter to limit crawling.
Top
Maximum crawling depth (MAXDEPTH)
The maximum depth to crawl from the starting URL. Zero means to stop crawling at the starting URL site. Each additional layer refers to following referenced links within the current URL.
- 3
- Referenced links will be crawled three layers deep.
- *NOMAX
- Referenced links will be crawled regardless of depth.
- 0-100
- Specify the maximum crawling depth.
Top
Enable robots (ENBROBOT)
Specifies whether to enable support for robot exclusion. If you select to support robot exclusion, any site or pages that contain robot exclusion META tags or files will not be downloaded.
- *YES
- Enable support for robot exclusion.
- *NO
- Do not enable support for robot exclusion.
Top
Proxy server for HTTP (PRXSVR)
Specifies the HTTP proxy server to be used.
- *NONE
- Do not use an HTTP proxy server.
- HTTP-proxy-server
- Specify the name of the HTTP proxy server.
Top
Proxy port for HTTP (PRXPORT)
Specifies the HTTP proxy server port.
- 1-65535.
- Specify the number of the HTTP proxy server port. This parameter is required if a proxy server name is specified for the Proxy server for HTTP (PRXSVR) parameter.
Top
Proxy server for HTTPS (PRXSVRSSL)
Specifies the HTTPS proxy server for using SSL support.
- *NONE
- Do not use an HTTPS proxy server.
- character-value
- Specify the name of the HTTPS proxy server for SSL support.
Top
Proxy port for HTTPS (PRXPORTSSL)
Specifies the HTTPS proxy server port for SSL support.
- 1-65535
- Specify the number of the HTTPS proxy server port for SSL support. This is required if an SSL proxy server is also specified. This parameter is required if a proxy server name is specified for the Proxy server for HTTPS (PRXSVRSSL) parameter.
Top
Maximum file size (MAXSIZE)
Specifies the maximum file size, in kilobytes, to download.
- 1000
- Download files that are no greater than 1000 kilobytes.
- *NOMAX
- Files will be downloaded regardless of size.
- 1-6000.
- Specify the maximum file size to download, in kilobytes.
Top
Maximum storage size (MAXSTGSIZE)
Specifies the maximum storage size, in megabytes, to allocate for downloaded files. Crawling will end when this limit is reached.
- 100
- Up to 100 megabytes of storage will be used for downloaded files.
- *NOMAX
- No maximum storage size for downloaded files.
- 1-65535.
- Specify the maximum storage size, in megabytes, for downloaded files.
Top
Maximum threads (MAXTHD)
Specifies the maximum number of threads to start for crawling web sites. Set this value based on the system resources that are available.
- 20
- Start up to 20 threads for web crawling.
- 1-50.
- Specify the maximum number of threads to start.
Top
Maximum run time (MAXRUNTIME)
Specifies the maximum time for crawling to run, in hours and minutes.
Single values
- *NOMAX
- Run the crawling session until it completes normally or is ended by using the ENDHTTPCRL (End HTTP Crawler) command.
Element 1: Hours
- 2
- Run the crawling session for 2 hours plus the number of minutes specified.
- 0-1000.
- Specify the number of hours to run the crawling session.
Element 2: Minutes
- 0
- Run the crawling session for the number of hours specified.
- *SAME
- Use this value when you are updating the options object, but want to use the same maximum number of minutes to run.
- 0-59.
- Specify the number of minutes to run the crawling session. The crawling session will run for the number of hours specified in the first element of this parameter plus the number of minutes specified.
Top
Logging file (LOGFILE)
Specifies the activity logging file to be used. This file contains information about the crawling session plus any errors that occur during the crawling session. This file must be in a directory.
- *NONE
- Do not use an activity log file.
- path-name
- Specify the path name of the logging file.
Top
Clear logging file (CLRLOG)
Specifies whether to clear the activity log file before starting the crawling session.
- *YES
- Always clear the activity log file before each crawling session.
- *NO
- Do not clear the activity log file.
Top
Validation list (VLDL)
Specifies the validation list to use for SSL sessions. Use the Configure HTTP Search (CFGHTTPSCH) command to create a validation list object.
- *NONE
- Do not use a validation list object.
- name
- Specify the name of the validation list.
Top
Examples
STRHTTPCRL OPTION(*CRTDOCL) DOCLIST('/mydir/my.doclist') URL('http://www.ibm.com') MAXDEPTH(2)This command starts a new crawling session finding referenced links 2 layers from the starting URL at www.ibm.com. The document list will be created in '/mydir/my.doclist' and will contain sets of a local directory path, for example, '/QIBM/USERDATA/HTTPSVR/INDEX/DOC/www.ibm.com/us/index.html' and the actual URL to the page 'http://www.ibm.com/us/'. Use the Configure HTTP Search (CFGHTTPSCH) command to create an index using this document list.
Top
Error messages
*ESCAPE Messages
- HTP160C
- Request to create or append to a document list failed. Reason &1.
- HTP166E
- Request to print the status of a document list failed. Reason &1.
Top