+

Search Tips   |   Advanced Search

Web clipping limitations


View the limitations of Web clipping along with detailed explanations.


Incorrect or unexpected results on the text clipping page

When creating a Web clipping portlet using the Text clipping option, you may encounter incorrect or unexpected results on the text clipping page that shows the candidate portions of the document to retain. This can happen because "text" clipping identifies the portions of the document to retain by operating on the content of the document at the byte level without interpreting or imposing any structure upon the document content. For this reason, extreme care must be taken when choosing the start and end strings. A detailed explanation of the limitations and dangers of text clipping follows.

Text clipping uses the following process:

  1. The start string, end string, and content are converted to their UTF-8 byte representations.

  2. The byte representation of the content is searched for literal occurrences of the start string byte sequence, followed by the end string byte sequence.

  3. For each occurrence, all bytes between the start and end sequences are extracted and converted back into a UTF-8 string.

This mechanism provides the Web Clipping portlet author with a non-structural approach to document clipping. However, the original document content is always HTML. HTML is inherently a structured content type, and the structure is defined by special byte sequences (tags) that have a particular meaning. Extracting arbitrary slices on that sequence without regard for the special byte sequences is dangerous for the following reasons:

The resulting "clipped" content may alter the semantics of the HTML document and has a high probability for causing unexpected or incorrect results when the document is rendered by a user-agent. Consider the following example.

Within some HTML document is the following content:

<A HREF="http://www.ibm.com">Go to IBM</A>
<A HREF="http://www.lotus.com">Go To Lotus</A>

If text clipping is used to clip this document using the start string "ibm", the end string "lotus", and retaining the start and end strings, the following sequence would be clipped:

ibm.com">Go to IBM</A>
<A HREF="http://www.lotus

In this example, we have now lost the first "<A". The effect of this will be different depending upon the user-agent that receives it. In all likelihood, the "</A>" will be thrown away and the text prior to it will be interpreted as text (not structural markup), in which case the ">" prior to the word "Go" is an invalid character data since it is not escaped. The following "<A HREF..." may or may not automatically be closed, but it will almost certainly cause problems in any user-agent.


More on text clipping limitations and restrictions

The Text clipping option enables you to select the content between specific text strings in the HTML document. Content between these strings is kept, and all other content is discarded. However, as with all clipping types (including Keep All Content), before the content you intend to clip is pulled for editing, the HTML and BODY tags in the original HTML document are removed, and the HEAD tag and the entire contents of the HEAD section are removed to prepare the document for display within a portal. The implication of this concerning text clipping is that the HTML, BODY, and HEAD tags, along with the entire contents of the HEAD section, will not be available for use within the starting or ending text strings used to perform text clipping.

For example, specifying a starting text string of </HTML> and an ending text string of </HTML> will yield no matching pieces of text. However, the required end result can be easily achieved using either the HTML clipping option or the Keep all content option.

Tip: If to clip an entire page, use either the HTML clipping option or the Keep all content option. To specify clipping options, click Advanced options, then click Modify clipping type.


Portlets created during installation of the Web Clipping WAR file

When the Web Clipping WAR file is installed, two associated portlets appear in the list of available portlets: Web Portlet HTML Template and Web Clipping Editor. Only the Web Clipping Editor portlet can be added to a page.


Clipping sites containing JavaScript

The use of JavaScript within Web-based content is widespread. JavaScript is used for two primary reasons:

  1. JavaScript can be used within Web-based content to make the page interactive. This is done through the use of a simple event response paradigthat is well known among user interface developers.

  2. JavaScript can be used to generate dynamic contento that is to generate content "on the fly". This procedure is executed client-sidethat is within the user agent. It is executed after the content has been retrieved from the server and returned to the user-agent but before the document is rendered. This can be useful in generating content that is dependent on the user-agent or client environment. You might say that HTML alone is "environmentally challenged" and JavaScript provides one solution to this problem.

Ideally, a given Web page would act within a Web clipping portlet just like it does in a stand-alone browser. In this respect, Web clipping is a sort of "Portlet Web Browser" and can be considered a unique user-agent that has unique restrictions and characteristics with respect to display and interaction mechanisms, especially with JavaScript.

In versions of WebSphere Portal prior to version 5.0, Web clipping portlets did not have any special functionality to deal with JavaScript. In Version 5.0 of the portal, functionality has been added to help enhance Web clipping portlets containing JavaScript as follows:

These enhancements provide support for a large amount of pages containing JavaScript. However, some pages might still not function properly. In particular, the following restrictions will still apply:

Runtime restrictions:

  1. JavaScript that uses relative URLs will be broken due to the fact that these are not rewritten during URL rewriting. That is, URLs within JavaScript (relative or not) will not be modified.

  2. JavaScript that depends on a specific hierarchy of a page structure using the DHTML models provided by various browsers may act unexpectedly depending on the situation.

  3. JavaScript that depends on specific browser functionality may not be viewable within other browsers (for example, Netscape 6 functionality vs. Internet Explorer 6 functionality), may act unexpectedly, or not at all.

HTML Clipping restrictions:

  1. All <SCRIPT> blocks defined by the HTML <SCRIPT> element and JavaScript within the <HEAD> element of the HTML document being clipped will be removed.

  2. All JavaScripts, including all event handlers and embedded JavaScripts, are removed prior to displaying the HTML page being clipped. This means that for those scripts that generate content, the content will not be displayed in the HTML clipping editor and therefore cannot be clipped.

  3. For the same reason, <SCRIPT> blocks that are located within the <BODY> element of the HTML document cannot be individually retained. They may be retained implicitly if an element containing the <SCRIPT> element within the document hierarchy is selected to be retained.

    For example, if the <BODY> directly contains a <SCRIPT> element chilthat generates some content, and the <BODY> element is not selected to be retained, the SCRIPT will be lost. However, if a <TD> element within the document contained a <SCRIPT> element that generated content for the <TD>, and the <TD> element is selected to be retained, the <SCRIPT> would be retained as well (barring the Remove JavaScript security constraint switch).

  4. JavaScript within HTML implicit event handlers (such as onLoad, onMouseOver, and onKeyDown) will only be retained if the element which defines the attribute is retained.

  5. JavaScript embedded within HREF attributes (using the JavaScript: prefix or &{...} syntax) will only be retained if the element which defines the attribute is retained.

In general, it is not a good idea to use the HTML Clipping type together with pages with JavaScript. Instead, use the Keep All Content clipping type to integrate these types of pages.


Double-byte character set limitation

If a Web page we are trying to clip does not contain a charset or contains a charsethat is not supported by the Web Clipping Editor, then the Web Clipping Editor defaults to the ISO-8859 charset. In this case, double-byte character set characters may not be displayed correctly.


HTML clipping limitations and restrictions

You might notice that at times it is difficult, if not impossible, to clip some Web pages using HTML clipping. In fact, for various technical reasons, there are certain elements within Web content that cannot be clipped using HTML clipping. This section explores some of the well-known limitations and restrictions of HTML clipping.


No content appearing in the Web clipping portlet

Due to the limitations of the HTML parser used by the Web Clipping Editor, certain pages with excessively malformed HTML cannot be fixed for proper display. In such cases, unexpected results may occur or no content may appear within the Web clipping portlet.


<FRAME> elements

The HTML FRAME support consists of the following:

Note the following restrictions concerning HTML FRAME support:


iFrames

The Web Clipping portlet was modified to be able to perform the equivalent function of an older WebPage/IFRAME portlet. As with the older portlets, the following set of restrictions apply when the Web Clipping portlet is configured to display content embedded in an iFrame and is configured to allow the browser/user-agent to access referenced resources directly.


Element selection issues

As you work with HTML clipping, we might have difficulties selecting the page elements to clip. This is most noticeable, for example, when to clip an entire table, buthat table does not have a border or any other visual elements used to select it. Instead, we are forced to select all the columns within the table individually and end up with the correct data but the wrong format (not grouped together in the original table).

The only workaround to this is by is trial and error, clicking or selecting different areas of the rendered output to clip and then previewing the contents to see if you achieved the required result. This process is made easier using the preview function that lets you view the results of each selection attempt without having to go through the process of adding the Web clipping portlet to a page and examining its contents.

The HTML clipping tool allows us to select one element and then toggle elements contained withis that element, making it appear as though we can keep all the content of a selected item, except for one or two of the elements that it contains. As useful as this can be, we cannot do it using the selection method described previously. To keep the entire contents of a single element except for a few of its sub-elements, you have to individually select just the sub-elements to keep.

For example, to keep all the contents of a <TABLE> except for the contents of one <TD> element, we cannot select the <TABLE> element and then select the <TD>o that you do not want. Instead, you must select all the <TD>that you do want. Unfortunately, as mentioned before, because we are forced to select the <TD> elements individually, the data will not be grouped as it was in the original table.


Parent: Web clipping