Reference: Web clipping limitations

 

+

Search Tips   |   Advanced Search

 

  1. Incorrect or unexpected results on the text clipping page
  2. More on text clipping limitations and restrictions
  3. Portlets created during installation of the Web Clipping WAR file
  4. Clipping sites containing JavaScript
  5. Double-byte character set limitation
  6. HTML clipping limitations and restrictions
  7. No content appearing in the Web clipping portlet
  8. <FRAME> elements
  9. Element selection issues
  10. Browser limitations

 

Incorrect or unexpected results on the text clipping page

When creating a Web clipping portlet using the Text clipping option, you may encounter incorrect or unexpected results on the text clipping page that shows the candidate portions of the document to retain. This can happen because "text" clipping identifies the portions of the document to retain by operating on the content of the document at the byte level without interpreting or imposing any structure upon the document content. For this reason, care must be taken when choosing the start and end strings.

Text clipping uses the following process:

  1. The start string, end string, and content are converted to their UTF-8 byte representations.

  2. The byte representation of the content is searched for literal occurrences of the start string byte sequence, followed by the end string byte sequence.

  3. For each occurrence, all bytes between the start and end sequences are extracted and converted back into a UTF-8 string.

This mechanism provides the Web Clipping portlet author with a non-structural approach to document clipping. However, the original document content is always HTML. HTML is inherently a structured content type, and the structure is defined by special byte sequences (tags) that have a particular meaning. Extracting arbitrary slices of that sequence without regard for the special byte sequences is dangerous for the following reasons:

  • Any given byte sequence may begin or end within the middle of one of the special byte sequences that define document structure.

  • The HTML document structure is hierarchical, that is, certain document structures depend on parent structures to define their interpretation. Extracting arbitrary sequences is a lossy process; extracting arbitrary sequences may cause a loss of meaning by taking a child structure out of context from its parent structure.

The resulting "clipped" content may alter the semantics of the HTML document and has a high probability for causing unexpected or incorrect results when the document is rendered by a user-agent. Consider the following example.

Within some HTML document is the following content:

<A HREF="http://www.ibm.com">Go to IBM </A>
<A HREF="http://www.lotus.com">Go To   Lotus  </A>

If text clipping is used to clip this document using the start string "ibm", the end string "lotus", and retaining the start and end strings, the following sequence would be clipped:

ibm.com">Go to IBM </A>
<A HREF="http://www.lotus 

In this example, you have now lost the first "<A". The effect of this will be different depending upon the user-agent that receives it. In all likelihood, the "</A>" will be thrown away and the text prior to it will be interpreted as text (not structural markup), in which case the ">" prior to the word "Go" is an invalid character data since it is not escaped. The following "<A HREF..." may or may not automatically be closed, but it will almost certainly cause problems in any user-agent.

 

More on text clipping limitations and restrictions

The Text clipping option enables us to select the content between specific text strings that are in the HTML document. Content between these strings is kept, and all other content is discarded. However, as with all clipping types (including Keep All Content), before the content you intend to clip is pulled for editing, the HTML and BODY tags in the original HTML document are removed, and the HEAD tag and the entire contents of the HEAD section are removed to prepare the document for display within a portal. The implication of this concerning text clipping is that the HTML, BODY, and HEAD tags, along with the entire contents of the HEAD section, will not be available for use within the starting or ending text strings used to perform text clipping. For example, specifying a starting text string of </HTML> and an ending text string of </HTML> will yield no matching pieces of text. However, the desired end result can be achieved using either the HTML clipping option or the Keep all content option.

Tip: If you would like to clip an entire page, use either the HTML clipping option or the Keep all content option. To specify clipping options, click Advanced options, then click Modify clipping type.

 

Portlets created during installation of the Web Clipping WAR file

When the Web Clipping WAR file is installed, two associated portlets appear in the list of available portlets: Web Portlet HTML Template and Web Clipping Editor. Only the Web Clipping Editor portlet can be added to a page.

  • Web Portlet HTML Template is used as a template for new portlets created by the Web Clipping Editor and cannot be added to a page. Adding the Web Portlet HTML Template to a page will result in an error.

  • Web Clipping Editor is the GUI for creating and editing Web clipping portlets and can be successfully added to a page.

 

Clipping sites containing JavaScript

JavaScript is used to...

  1. Make pages interactive through the use of an event response paradigm.

  2. Generate dynamic content.

    JavaScript is executed after the content has been retrieved from the server and returned to the user-agent but before the document is rendered. This can be useful in generating content that is dependent on the user-agent or client environment.

Ideally, a given Web page would act within a Web clipping portlet just like it does in a stand-alone browser. In this respect, Web clipping is a sort of "Portlet Web Browser" and can be considered a unique user-agent that has unique restrictions and characteristics with respect to display and interaction mechanisms, especially with JavaScript.

In versions of WebSphere Portal prior to v5.0, Web clipping portlets did not have any special functionality to deal with JavaScript. In WebSphere Portal v5.0, functionality has been added to help enhance Web clipping portlets containing JavaScript as follows:

  • All JavaScript on a the source Web page will be retained, unless otherwise specified using the Remove all JavaScript security option.

  • JavaScript within the HEAD of a document will be relocated to the BODY of the document prior to any other children of the BODY.

    No other modifications to JavaScript will be made automatically.

These enhancements provide support for a large amount of pages containing JavaScript. However, some pages might still not function properly. In particular, the following restrictions will still apply:

Runtime restrictions

  1. JavaScript that uses relative URLs will be broken due to the fact that these are not rewritten during URL rewriting. That is, URLs within JavaScript (relative or not) will not be modified.

  2. JavaScript that depends on a specific hierarchy of a page structure using the DHTML models provided by various browsers may act unexpectedly depending on the situation.

  3. JavaScript that depends on specific browser functionality may not be viewable within other browsers (for example, Netscape 6 functionality vs. Internet Explorer 6 functionality), may act unexpectedly, or not at all.

HTML Clipping restrictions

  1. All <SCRIPT> blocks defined by the HTML <SCRIPT> element and JavaScript within the <HEAD> element of the HTML document being clipped will be removed.

  2. All JavaScripts, including all event handlers and embedded JavaScripts, are removed prior to displaying the HTML page being clipped. This means that for those scripts that generate content, the content will not be displayed in the HTML clipping editor and therefore cannot be clipped.

  3. For the same reason, <SCRIPT> blocks that are located within the <BODY> element of the HTML document cannot be individually retained. They may be retained implicitly if an element that contains the <SCRIPT> element within the document hierarchy is selected to be retained. For example, if the <BODY> directly contains a <SCRIPT> element child that generates some content, and the <BODY> element is not selected to be retained, the SCRIPT will be lost. However, if a <TD> element within the document contained a <SCRIPT> element that generated content for the <TD>, and the <TD> element is selected to be retained, the <SCRIPT> would be retained as well (barring the Remove JavaScript security constraint switch).

  4. JavaScript within HTML implicit event handlers (such as onLoad, onMouseOver, and onKeyDown) will only be retained if the element which defines the attribute is retained.

  5. JavaScript embedded within HREF attributes (using the JavaScript: prefix or &{...} syntax) will only be retained if the element which defines the attribute is retained.

Tip: In general, it is not a good idea to use the HTML Clipping type together with pages with JavaScript. Instead, use the Keep All Content clipping type to integrate these types of pages.

 

Double-byte character set limitation

If a Web page you are trying to clip does not contain a charset or contains a charset that is not supported by the Web Clipping Editor, then the Web Clipping Editor defaults to the ISO-8859 charset. In this case, double-byte character set characters may not be displayed correctly.

 

HTML clipping limitations and restrictions

You might notice that at times it is difficult, if not impossible, to clip some Web pages using HTML clipping. In fact, for various technical reasons, there are certain elements within Web content that cannot be clipped using HTML clipping. This section explores some of the well-known limitations and restrictions of HTML clipping.

 

No content appearing in the Web clipping portlet

Due to the limitations of the HTML parser used by the Web Clipping Editor, certain pages with excessively malformed HTML cannot be fixed for proper display. In such cases, unexpected results may occur or no content may appear within the Web clipping portlet.

 

<FRAME> elements

The HTML FRAME support consists of the following:

  • Enablement of all existing HTML-based Web clipping portlets to navigate to pages containing FRAME or FRAMESET elements. That is, if you have any existing Web clipping portlets and somewhere within the content of those portlets is a link to a page that contains FRAMESET elements, the link can now be traversed and the content displayed and navigated.

  • Creation of new HTML-based Web clipping portlets against pages that contain FRAMESET or FRAME elements using "Keep All Content" mode only.

  • Both Inter and Intra FRAME navigation is supported on pages with FRAMEs, just as in a desktop browser.

Note the following restrictions concerning HTML FRAME support:

  • FRAME tags that include the onload and onunload attributes for executing JavaScript functions are not preserved when converting the FRAME to a table cell. There is no support for those attributes on table cell (<TD>) elements.

  • The "Keep All Content" mode is required during creation of new Web clipping portlets directly referencing pages containing FRAMEs. Web clipping portlets can be created from content containing FRAMES only if the "Keep All Content" mode is used as opposed to the "HTML Clipping" mode. We can continue to create Web clipping portlets that indirectly reference pages containing FRAMES or FRAMESETs through a link, however you may not clip those pages as we can with pages referenced that do not contain FRAMEs. FRAME navigation is not supported in the editor (on the Finish page or HTML clipping page) as a result of this restriction.

  • Links in the created portlets cannot be followed if those pages contain embedded FRAMEs. For new portlets that contain FRAMEs and new or existing portlets that indirectly reference pages with FRAMEs, the links in that portlet can be navigated as usual. However, indeterminate results will occur if the links also contain FRAMEs, that is if they reference a page that contains new FRAMESET or FRAME elements.

  • FRAME support is not provided for non-HTML user agents. Currently, the successful presentation and navigation of pages containing FRAMEs will work only for portlets that are targeted to be used with HTML-based user agents that support HTML conforming to the HTML 3.2 specification and above. Web clipping portlets that encounter pages with FRAMESET or FRAME elements cannot be viewed from mobile devices or non-HTML devices.

 

Element selection issues

As you work with HTML clipping, you might have difficulties selecting the page elements we want to clip. This is most noticeable, for example, when you want to clip an entire table, but that table does not have a border or any other visual elements that we can use to select it. Instead, you are forced to select all the columns within the table individually and end up with the correct data but the wrong format (not grouped together in the original table).

The only workaround to this is by is trial and error, clicking or selecting different areas of the rendered output to clip and then previewing the contents to see if you achieved the desired result. This process is made easier by using the preview function that lets us view the results of each selection attempt without having to go through the process of adding the Web clipping portlet to a page and examining its contents.

The HTML clipping tool allows us to select one element and then toggle elements contained within that element, making it appear as though we can keep all the content of a selected item, except for one or two of the elements that it contains. As useful as this can be, we cannot do it using the selection method described above. Instead, to keep the entire contents of a single element except for a few of its sub-elements, you have to individually select just the sub-elements we want to keep. For example, to keep all the contents of a <TABLE> except for the contents of one <TD> element, we cannot select the <TABLE> element and then select the <TD> that you do not want. Instead, select all the <TD>s that you do want. Unfortunately, as mentioned before, because you are forced to select the <TD> elements individually, the data will not be grouped as it was in the original table.

 

iFrames

The Web Clipping portlet was modified to be able to perform the equivalent function of an older WebPage/IFRAME portlet. As with the older portlets, the following set of restrictions apply when the Web Clipping portlet is configured to display content embedded in an iFrame and is configured to allow the browser/user-agent to access referenced resources directly.

  • • The portlet will not authenticate when viewed from Internet Explorer with the MS04-004 Cumulative Security Update applied. The MS04-004 update disables the ability to use URLs of the format http://username:password@www.acme.com/login.jsp. This format was disabled for security purposes since the username and password appear in cleartext within the URL and can be compromised

  • • The portal server and the server hosting the site for which authentication is required must be within the same top-level domain (for example, acme.com). Due to security restrictions, browsers/user-agents will not accept cookies for server a.b.c if they are sent by server x.y.z or any other server outside the domain b.c. Doing so would allow potential spoof attacks to gain access to authenticated content on server a.b.c without its explicit consent. For this reason, the portal server on which the portlet resides and the server(s) hosting the site which requires authentication must be within the same top-level domain.

  • • User-agents must access the portlet using fully-qualified domain when the portlet uses FORM-based authentication. When an end-user accesses the portlet from a browser, they must use the fully-qualified domain name in the address bar. For example, http://portalone.acme.com/wps/portal and not http://portalone/wps/portal. The latter form will not work due to security restrictions in the browser with regards to cookies from alternative domains than the server on which the response originates.

 

Browser limitations

Netscape Communicator and Navigator 4.7 will occasionally prematurely disconnect from an outstanding request to a Web clipping portlet, causing an empty document to be returned instead of the appropriate content. Later versions of the Netscape browsers do not exhibit this problem.

Internet Explorer 5.0 sometimes causes content from one portlet to "bleed" into the content of another portlet on the same portal page, meaning text from one portlet appears to overlay text from another. Resizing the browser window corrects the problem. Later versions of Internet Explorer do not exhibit this behavior.

 

Related information

 

Parent topic:

Use Web clipping to import content