Web crawling on HTTP Server

 

This topic provides information about Web crawling and Web crawlers for the HTTP Server for i5/OS.

Information for this topic supports the latest PTF levels for HTTP Server for i5/OS . IBM recommends that you install the latest PTFs to upgrade to the latest level of the HTTP Server for i5/OS. Some of the topics documented here are not available prior to this update. See IBM Service for more information.

A Web crawler is a program that finds a URL on another Web server. A "crawl" is the Web crawler program following links within Web pages and downloading HTML and text pages it finds. The Web crawler downloads files to your local directory, and creates a document list. The document list and the files can then be used to create a search index. The search results will link to the actual URL that was found during the crawl. Attention: The Web crawler downloads text and HTML files to your iSeries™. The iSeries checks if sufficient memory is available for a successful Web crawl, but it will not check for available storage.

To crawl a Web site, specify attributes such as the document storage directory, the URL to crawl, and so on. Alternately, you may start a crawl using a URL and options object that you have already created using other forms. A URL object contains a list of URLs. An options object contains crawling attributes, such as the proxy server to use for each crawling session.

Some sites cannot be entered without some sort of authentication, such as a userid and password, or certificate authentication. The web crawler has the capacity to handle either case as long as you do the required set up.

For a site requiring a userid and password, create a validation list object, entering the URL, userid, and password. See Setting up validation lists for the Webserver search engine on HTTP Server for more information. Then be sure to enter the validation list object when you start crawling. See the digital server certificate information on how to obtain certificate authentication. The digital certificate manager can be used to obtain a new, or register an existing, certificate for any secure server instance of the IBM® HTTP Server.

Building a document list by crawling Web sites always runs as a background task and will take several minutes, at a minimum, to run, depending on the maximum time you selected for the session to run, as well as other attributes you have specified.

See Build the document list by crawling a URL for information on how to use the Web crawler with the Web Administration for i5/OS interface.

 

Parent topic:

Concepts of functions of HTTP Server