+

Search Tips   |   Advanced Search


Web clipping limitations


Incorrect or unexpected results on the text clipping page

When creating a Web clipping portlet using the Text clipping option, you may encounter incorrect or unexpected results on the text clipping page that shows the candidate portions of the document to retain. This can happen because "text" clipping identifies the portions of the document to retain by operating on the content of the document at the byte level without interpreting or imposing any structure upon the document content. For this reason, extreme care must be taken when choosing the start and end strings. A detailed explanation of the limitations and dangers of text clipping follows.

Text clipping uses the following process:

  1. The start string, end string, and content are converted to their UTF-8 byte representations.

  2. The byte representation of the content is searched for literal occurrences of the start string byte sequence, followed by the end string byte sequence.

  3. For each occurrence, all bytes between the start and end sequences are extracted and converted back into a UTF-8 string.

This mechanism provides the Web Clipping portlet author with a non-structural approach to document clipping. However, the original document content is always HTML. HTML is inherently a structured content type, and the structure is defined by special byte sequences (tags) that have a particular meaning. Extracting arbitrary slices of that sequence without regard for the special byte sequences is dangerous for the following reasons:

The resulting "clipped" content may alter the semantics of the HTML document and has a high probability for causing unexpected or incorrect results when the document is rendered by a user-agent. Consider the following example.

Within some HTML document is the following content:

<A HREF="http://www.ibm.com">Go to IBM</A>
<A HREF="http://www.lotus.com">Go To Lotus</A>

If text clipping is used to clip this document using the start string "ibm", the end string "lotus", and retaining the start and end strings, the following sequence would be clipped:

ibm.com">Go to IBM</A>
<A HREF="http://www.lotus

In this example, we have now lost the first "<A". The effect of this will be different depending upon the user-agent that receives it. In all likelihood, the "</A>" will be thrown away and the text prior to it will be interpreted as text (not structural markup), in which case the ">" prior to the word "Go" is an invalid character data since it is not escaped.

The following "<A HREF..." may or may not automatically be closed, but it will almost certainly cause problems in any user-agent.


More on text clipping limitations and restrictions

The Text clipping option enables you to select the content between specific text strings that are in the HTML document. Content between these strings is kept, and all other content is discarded. However, as with all clipping types (including Keep All Content), before the content you intend to clip is pulled for editing, the HTML and BODY tags in the original HTML document are removed, and the HEAD tag and the entire contents of the HEAD section are removed to prepare the document for display within a portal.

The implication of this concerning text clipping is that the HTML, BODY, and HEAD tags, along with the entire contents of the HEAD section, will not be available for use within the starting or ending text strings used to perform text clipping.

For example, specifying a starting text string of </HTML> and an ending text string of </HTML> will yield no matching pieces of text. However, the desired end result can be easily achieved using either the HTML clipping option or the Keep all content option.

If you would like to clip an entire page, use either the HTML clipping option or the Keep all content option. To specify clipping options, click Advanced options, then click Modify clipping type.


Portlets created during installation of the Web Clipping WAR file

When the Web Clipping WAR file is installed, two associated portlets appear in the list of available portlets:

Web Portlet HTML Template is used as a template for new portlets created by the Web Clipping Editor and cannot be added to a page. Adding the Web Portlet HTML Template to a page will result in an error.

Web Clipping Editor is the GUI for creating and editing Web clipping portlets and can be successfully added to a page.


Clipping sites containing JavaScript

JavaScript is used for two primary reasons:

  1. Make pages interactive through the use of an event response paradigm.

  2. Generate dynamic content.

    Procedures are executed client-side, within the user agent, and executed after the content has been retrieved from the server and returned to the user-agent, but before the document is rendered. This can be useful in generating content that is dependent on the user-agent or client environment.

Ideally, a given Web page would act within a Web clipping portlet just like it does in a stand-alone browser. In this respect, Web clipping is a sort of "Portlet Web Browser" and can be considered a unique user-agent that has unique restrictions and characteristics with respect to display and interaction mechanisms, especially with JavaScript.

All JavaScript on a the source Web page will be retained, unless otherwise specified using the option...

Remove all JavaScript security

JavaScript within the HEAD of a document will be relocated to the BODY of the document prior to any other children of the BODY.

No other modifications to JavaScript will be made automatically.

Runtime restrictions...

  1. JavaScript that uses relative URLs will be broken due to the fact that these are not rewritten during URL rewriting. That is, URLs within JavaScript (relative or not) will not be modified.

  2. JavaScript that depends on a specific hierarchy of a page structure using the DHTML models provided by various browsers may act unexpectedly depending on the situation.

  3. JavaScript that depends on specific browser functionality may not be viewable within other browsers (for example, Netscape 6 functionality vs. Internet Explorer 6 functionality), may act unexpectedly, or not at all.

HTML Clipping restrictions...

  1. All <SCRIPT> blocks defined by the HTML <SCRIPT> element and JavaScript within the <HEAD> element of the HTML document being clipped will be removed.

  2. All JavaScripts, including all event handlers and embedded JavaScripts, are removed prior to displaying the HTML page being clipped. This means that for those scripts that generate content, the content will not be displayed in the HTML clipping editor and therefore cannot be clipped.

  3. For the same reason, <SCRIPT> blocks that are located within the <BODY> element of the HTML document cannot be individually retained. They may be retained implicitly if an element that contains the <SCRIPT> element within the document hierarchy is selected to be retained.

    For example, if the <BODY> directly contains a <SCRIPT> element child that generates some content, and the <BODY> element is not selected to be retained, the SCRIPT will be lost. However, if a <TD> element within the document contained a <SCRIPT> element that generated content for the <TD>, and the <TD> element is selected to be retained, the <SCRIPT> would be retained as well (barring the Remove JavaScript security constraint switch).

  4. JavaScript within HTML implicit event handlers (such as onLoad, onMouseOver, and onKeyDown) will only be retained if the element which defines the attribute is retained.

  5. JavaScript embedded within HREF attributes (using the JavaScript: prefix or &{...} syntax) will only be retained if the element which defines the attribute is retained.

In general, it is not a good idea to use the clipping type...

HTML Clipping

...together with pages with JavaScript. Instead, use the clipping type...

Keep All Content

...to integrate these types of pages.


Double-byte character set limitation

If a Web page you are trying to clip does not contain a charset or contains a charset that is not supported by the Web Clipping Editor, then the Web Clipping Editor defaults to the ISO-8859 charset. In this case, double-byte character set characters may not be displayed correctly.


HTML clipping limitations and restrictions

No content appearing in the Web clipping portlet

Due to the limitations of the HTML parser used by the Web Clipping Editor, certain pages with excessively malformed HTML cannot be fixed for proper display. In such cases, unexpected results may occur or no content may appear within the Web clipping portlet.


<FRAME> elements

The HTML FRAME support consists of the following:

Note the following restrictions concerning HTML FRAME support:


iFrames

The Web Clipping portlet was modified to be able to perform the equivalent function of an older WebPage/IFRAME portlet. The following set of restrictions apply when the Web Clipping portlet is configured to display content embedded in an iFrame and is configured to allow the browser/user-agent to access referenced resources directly.


Element selection issues

As you work with HTML clipping, you might have difficulties selecting the page elements you want to clip. This is most noticeable, for example, when you want to clip an entire table, but that table does not have a border or any other visual elements that you can use to select it. Instead, you are forced to select all the columns within the table individually and end up with the correct data but the wrong format (not grouped together in the original table).

The only workaround to this is by is trial and error, clicking or selecting different areas of the rendered output to clip and then previewing the contents to see if you achieved the desired result. This process is made easier by using the preview function that lets you view the results of each selection attempt without having to go through the process of adding the Web clipping portlet to a page and examining its contents.

The HTML clipping tool allows you to select one element and then toggle elements contained within that element, making it appear as though you can keep all the content of a selected item, except for one or two of the elements that it contains. As useful as this can be, you cannot do it using the selection method described above. Instead, if you want to keep the entire contents of a single element except for a few of its sub-elements, you have to individually select just the sub-elements you want to keep.

For example, if you want to keep all the contents of a <TABLE> except for the contents of one <TD> element, you cannot select the <TABLE> element and then select the <TD> that you do not want. Instead, select all the <TD>s that you do want. Unfortunately, as mentioned before, because you are forced to select the <TD> elements individually, the data will not be grouped as it was in the original table.


Parent topic:

Use Web clipping to import content