+

Search Tips   |   Advanced Search

Portal, V6.1


 

Web clipping limitations

 

Incorrect or unexpected results on the text clipping page

When creating a Web clipping portlet using the Text clipping option, you may encounter incorrect or unexpected results on the text clipping page that shows the candidate portions of the document to retain. This can happen because "text" clipping identifies the portions of the document to retain by operating on the content of the document at the byte level without interpreting or imposing any structure upon the document content. For this reason, extreme care must be taken when choosing the start and end strings. A detailed explanation of the limitations and dangers of text clipping follows.

Text clipping uses the following process:

  1. The start string, end string, and content are converted to their UTF-8 byte representations.

  2. The byte representation of the content is searched for literal occurrences of the start string byte sequence, followed by the end string byte sequence.

  3. For each occurrence, all bytes between the start and end sequences are extracted and converted back into a UTF-8 string.

This mechanism provides the Web Clipping portlet author with a non-structural approach to document clipping. However, the original document content is always HTML. HTML is inherently a structured content type, and the structure is defined by special byte sequences (tags) that have a particular meaning. Extracting arbitrary slices of that sequence without regard for the special byte sequences is dangerous for the following reasons:

The resulting "clipped" content may alter the semantics of the HTML document and has a high probability for causing unexpected or incorrect results when the document is rendered by a user-agent. Consider the following example.

Within some HTML document is the following content:

<A HREF="http://www.ibm.com">Go to IBM</A>
<A HREF="http://www.lotus.com">Go To  Lotus </A>

If text clipping is used to clip this document using the start string "ibm", the end string "lotus", and retaining the start and end strings, the following sequence would be clipped:

ibm.com">Go to IBM</A>
<A HREF="http://www.lotus

In this example, we have now lost the first "<A". The effect of this will be different depending upon the user-agent that receives it. In all likelihood, the "</A>" will be thrown away and the text prior to it will be interpreted as text (not structural markup), in which case the ">" prior to the word "Go" is an invalid character data since it is not escaped. The following "<A HREF..." may or may not automatically be closed, but it will almost certainly cause problems in any user-agent.

 

More on text clipping limitations and restrictions

The Text clipping option enables you to select the content between specific text strings that are in the HTML document. Content between these strings is kept, and all other content is discarded. However, as with all clipping types (including Keep All Content), before the content you intend to clip is pulled for editing, the HTML and BODY tags in the original HTML document are removed, and the HEAD tag and the entire contents of the HEAD section are removed to prepare the document for display within a portal. The implication of this concerning text clipping is that the HTML, BODY, and HEAD tags, along with the entire contents of the HEAD section, will not be available for use within the starting or ending text strings used to perform text clipping. For example, specifying a starting text string of </HTML> and an ending text string of </HTML> will yield no matching pieces of text. However the desired end result can be achieved using either the HTML clipping option or the Keep all content option.

Tip: If you would like to clip an entire page, use either the HTML clipping option or the Keep all content option. To specify clipping options, click Advanced options then click Modify clipping type.

 

Portlets created during installation of the Web Clipping WAR file

When the Web Clipping WAR file is installed, two associated portlets appear in the list of available portlets: Web Portlet HTML Template and Web Clipping Editor. Only the Web Clipping Editor portlet can be added to a page.

 

Clipping sites containing JavaScript

The use of JavaScript within Web-based content is widespread. JavaScript is used for two primary reasons:

  1. JavaScript can be used within Web-based content to make the page interactive. This is done through the use of a simple event response paradigm that is well known among user interface developers.

  2. JavaScript can be used to generate dynamic content, that is to generate content "on the fly". This procedure is executed client-side, that is within the user agent. It is executed after the content has been retrieved from the server and returned to the user-agent but before the document is rendered. This can be very useful in generating content that is dependent on the user-agent or client environment. You might say that HTML alone is "environmentally challenged" and JavaScript provides one solution to this problem.

Ideally, a given Web page would act within a Web clipping portlet just like it does in a stand-alone browser. In this respect, Web clipping is a sort of "Portlet Web Browser" and can be considered a unique user-agent that has unique restrictions and characteristics with respect to display and interaction mechanisms, especially with JavaScript.

In versions of WebSphere Portal prior to version 5.0, Web clipping portlets did not have any special functionality to deal with JavaScript. In WebSphere Portal version 5.0, functionality has been added to help enhance Web clipping portlets containing JavaScript as follows:

These enhancements provide support for a large amount of pages containing JavaScript. However, some pages might still not function properly. In particular, the following restrictions will still apply:

Runtime restrictions

  1. JavaScript that uses relative URLs will be broken due to the fact that these are not rewritten during URL rewriting. That is, URLs within JavaScript (relative or not) will not be modified.

  2. JavaScript that depends on a specific hierarchy of a page structure using the DHTML models provided by various browsers may act unexpectedly depending on the situation.

  3. JavaScript that depends on specific browser functionality may not be viewable within other browsers

    For example...

    Netscape 6 functionality vs. Internet Explorer 6 functionality), may act unexpectedly, or not at all.

HTML Clipping restrictions

  1. All <SCRIPT> blocks defined by the HTML <SCRIPT> element and JavaScript within the <HEAD> element of the HTML document being clipped will be removed.

  2. All JavaScripts, including all event handlers and embedded JavaScripts, are removed prior to displaying the HTML page being clipped. This means that for those scripts that generate content, the content will not be displayed in the HTML clipping editor and therefore cannot be clipped.

  3. For the same reason, <SCRIPT> blocks that are located within the <BODY> element of the HTML document cannot be individually retained. They may be retained implicitly if an element that contains the <SCRIPT> element within the document hierarchy is selected to be retained. For example, if the <BODY> directly contains a <SCRIPT> element child that generates some content, and the <BODY> element is not selected to be retained, the SCRIPT will be lost. However, if a <TD> element within the document contained a <SCRIPT> element that generated content for the <TD>, and the <TD> element is selected to be retained, the <SCRIPT> would be retained as well (barring the Remove JavaScript security constraint switch).

  4. JavaScript within HTML implicit event handlers (such as onLoad, onMouseOver, and onKeyDown) will only be retained if the element which defines the attribute is retained.

  5. JavaScript embedded within HREF attributes (using the JavaScript: prefix or &{...} syntax) will only be retained if the element which defines the attribute is retained.

Tip: In general, it is not a good idea to use the HTML Clipping type together with pages with JavaScript. Instead, use the Keep All Content clipping type to integrate these types of pages.

 

Double-byte character set limitation

If a Web page you are trying to clip does not contain a charset or contains a charset that is not supported by the Web Clipping Editor, then the Web Clipping Editor defaults to the ISO-8859 charset. In this case, double-byte character set characters may not be displayed correctly.

 

HTML clipping limitations and restrictions

You might notice that at times it is difficult, if not impossible, to clip some Web pages using HTML clipping. In fact, for various technical reasons, there are certain elements within Web content that cannot be clipped using HTML clipping. This section explores some of the well-known limitations and restrictions of HTML clipping.

 

No content appearing in the Web clipping portlet

Due to the limitations of the HTML parser used by the Web Clipping Editor, certain pages with excessively malformed HTML cannot be fixed for proper display. In such cases, unexpected results may occur or no content may appear within the Web clipping portlet.

 

<FRAME> elements

The HTML FRAME support consists of the following:

Note the following restrictions concerning HTML FRAME support:

 

iFrames

The Web Clipping portlet was modified to be able to perform the equivalent function of an older WebPage/IFRAME portlet. As with the older portlets, the following set of restrictions apply when the Web Clipping portlet is configured to display content embedded in an iFrame and is configured to allow the browser/user-agent to access referenced resources directly.

 

Element selection issues

As you work with HTML clipping, you might have difficulties selecting the page elements you want to clip. This is most noticeable, for example, when you want to clip an entire table, but that table does not have a border or any other visual elements that use to select it. Instead, you are forced to select all the columns within the table individually and end up with the correct data but the wrong format (not grouped together in the original table).

The only workaround to this is by is trial and error, clicking or selecting different areas of the rendered output to clip and then previewing the contents to see if you achieved the desired result. This process is made easier by using the preview function that lets you view the results of each selection attempt without having to go through the process of adding the Web clipping portlet to a page and examining its contents.

The HTML clipping tool allows you to select one element and then toggle elements contained within that element, making it appear as though you can keep all the content of a selected item, except for one or two of the elements that it contains. As useful as this can be, you cannot do it using the selection method described above. Instead, if you want to keep the entire contents of a single element except for a few of its sub-elements, you have to individually select just the sub-elements you want to keep. For example, if you want to keep all the contents of a <TABLE> except for the contents of one <TD> element, you cannot select the <TABLE> element and then select the <TD> that you do not want. Instead, select all the <TD>s that you do want. Unfortunately, as mentioned before, because you are forced to select the <TD> elements individually, the data will not be grouped as it was in the original table.

 

Parent topic

Use Web clipping to import content