Importing existing Web resources using HTTP or FTP

You can import existing Web resources into a project using wizards that invoke HTTP or FTP. These import wizards automate the transfer of complete Web sites into Web projects by:

These import wizards also support the import capabilities for Web servers that are equipped with firewalls. Both HTTP and FTP import support Proxies while FTP import supports SOCKS. Rational® Developer adopts the Passive mode configuration setup while using FTP, minimizing security risk and providing you a safer transfer in your daily operations.

To use the HTTP or FTP import wizards, designate an existing project in which to import the files. You will be able to view all the files from the imported Web site within the selected project folder.

The HTTP import uses the HTTP protocol to crawl through the Web site based on an initial URL that you provide. The import action uses the URL to retrieve any HTML content available and also parses for HTTP links. The process repeats until it parses content and links that are referenced to other web pages that are encountered within the web site. HTTP import cannot parse pages that contain servlets or programs that are executed when a form is posted or embedded in JavaServer Pages (JSPs).

The files transferred to your project represent a logical snapshot of the Web site's URL. This means that your Web project is populated with files that are acquired by the HTML response of the serving site. This also means that it is not necessary that the physical resources on the serving site will be copied to your project. For example, an HTTP request for a JSP page will return a rendered HTML response, not the JSP page itself. It is recommended that you use HTTP import for static pages and for sites that do not have FTP access.

To import existing Web resources into the Web project using HTTP, complete the following steps:

  1. Create a new project where you wish to import Web resources using the

    New Web Project wizard.

  2. If you intend to use an existing project, select the project in the Enterprise Explorer view.

  3. Select

    File | Import.

  4. In the Import dialog, select

    HTTP and click Next.

  5. In the

    Specify the destination folder and the resources to import page, type the requisite project information.

    • Folder - The imported files are placed in the default location (the Web content folder). You can click the Browse button to change the location for the imported files for your project

    • URL - Type in the HTTP URL in the

      URL field. The URL should include the domain name and starting directory for the URL/initial web-page.

      • If you enter a directory URL without a start page (for example, www.domain.net/Sports/), the default file name will be used when the web server returns HTML content (for example, if you do not specify a default, index.html is used.).

      • HTTP crawling may create files that do not exist on the original server. For example, an HTTP reference to a directory may cause a Web server to respond with HTML content that describes the directory. The HTTP crawler saves this response as index.html

      • If you enter just a domain name (for example, www.domain.net), the Import wizard will try to find a default page in the document root directory.
      If you click the

      Advanced button, you have the option of specifying a proxy connection in the Advanced Settings dialog box. If you select the

      Use a proxy server check box, you will have the option of selecting a SOCKS or HTTP proxy, and supplying the corresponding server and port values.

    • Depth limit while following HTTP links - You can limit the scope of import that follows links by selecting the appropriate radio button provided.

      • No limit- This option will allow the HTTP import to parse through all pages within the domain.

      • Limit to- This option determines the depth limit of link levels that are crawled. For example, if you choose 1, all web pages within one link (level 1) from the page that it is being imported from will be navigated. If you limit it to 2, then all level 1 links and the ones linked directly to level 1 web pages will be imported.

        For example, one might specify a crawl depth of 2 and an initial URL http://host/initialLevel/index.html . If index.html has a reference to http://host/initialLevel/L2/L3/index2.html , then index2.html, which is at level 3, is filtered out and its content will not be parsed for follow on crawling.

  6. Click Next for more options, or Finish to import the Web site.

  7. If you select Next, in the Specify appropriate import options page, select among the choices provided.

    • Convert Links to document relative - If you select this option, links within HTML files are updated in a document-relative fashion, rather than creating absolute links based on their new location in a file system.

    • Overwrite existing resources without warning - If you select this option, existing workbench files in your project will be overwritten. If this option is not selected, files imported will not be overwritten. There is no prompting for selectively over-writing files.

    • Do not follow links to files in parent folders of the starting URL - If you select this option, you will prevent the FTP import to crawl resources above the initial provided URL. For example, if the initial URL is http://host/l1/l2/index.html and a link within the page references http://host/index.html, this option will determine whether the linked resource should be included in the import. If you do not have this option checked, you run the risk to crawl very large sites, and importing huge volumes of files unnecessarily.

    • Connection timeout - This option determines the HTTP connection timeout value. It is measured, in milliseconds. Connection timeouts are a way of specifying how long you would prefer to wait for a message from the server before giving up.

  8. Click Finish to import the Web site with options.

  9. Verify the resulting directory structure and file data integrity in the newly-populated project or folder.