Web clipping limitations

Web clipping limitations

Incorrect or unexpected results on the text clipping page

When creating a Web clipping portlet using the Text clipping option, you may encounter incorrect or unexpected results on the text clipping page that shows the candidate portions of the document to retain. This can happen because "text" clipping identifies the portions of the document to retain by operating on the content of the document at the byte level without interpreting or imposing any structure upon the document content. For this reason, extreme care must be taken when choosing the start and end strings. A detailed explanation of the limitations and dangers of text clipping follows.
Text clipping uses the following process:

The start string, end string, and content are converted to their UTF-8 byte representations.
The byte representation of the content is searched for literal occurrences of the start string byte sequence, followed by the end string byte sequence.
For each occurrence, all bytes between the start and end sequences are extracted and converted back into a UTF-8 string.

This mechanism provides the Web Clipping portlet author with a non-structural approach to document clipping. However, the original document content is always HTML. HTML is inherently a structured content type, and the structure is defined by special byte sequences (tags) that have a particular meaning. Extracting arbitrary slices of that sequence without regard for the special byte sequences is dangerous for the following reasons:

Any given byte sequence may begin or end within the middle of one of the special byte sequences that define document structure.
The HTML document structure is hierarchical, that is, certain document structures depend on parent structures to define their interpretation. Extracting arbitrary sequences is a lossy process; extracting arbitrary sequences may cause a loss of meaning by taking a child structure out of context from its parent structure.

The resulting "clipped" content may alter the semantics of the HTML document and has a high probability for causing unexpected or incorrect results when the document is rendered by a user-agent. Consider the following example.
Within some HTML document is the following content:
<A HREF="http://www.ibm.com">Go to IBM</A>
<A HREF="http://www.lotus.com">Go To Lotus</A>
If text clipping is used to clip this document using the start string "ibm", the end string "lotus", and retaining the start and end strings, the following sequence would be clipped:
ibm.com">Go to IBM</A>
<A HREF="http://www.lotus
In this example, we have now lost the first "<A". The effect of this will be different depending upon the user-agent that receives it. In all likelihood, the "</A>" will be thrown away and the text prior to it will be interpreted as text (not structural markup), in which case the ">" prior to the word "Go" is an invalid character data since it is not escaped.
The following "<A HREF..." may or may not automatically be closed, but it will almost certainly cause problems in any user-agent.

More on text clipping limitations and restrictions

The Text clipping option enables you to select the content between specific text strings that are in the HTML document. Content between these strings is kept, and all other content is discarded. However, as with all clipping types (including Keep All Content), before the content you intend to clip is pulled for editing, the HTML and BODY tags in the original HTML document are removed, and the HEAD tag and the entire contents of the HEAD section are removed to prepare the document for display within a portal.
The implication of this concerning text clipping is that the HTML, BODY, and HEAD tags, along with the entire contents of the HEAD section, will not be available for use within the starting or ending text strings used to perform text clipping.
For example, specifying a starting text string of </HTML> and an ending text string of </HTML> will yield no matching pieces of text. However, the desired end result can be easily achieved using either the HTML clipping option or the Keep all content option.
If you would like to clip an entire page, use either the HTML clipping option or the Keep all content option. To specify clipping options, click Advanced options, then click Modify clipping type.

Portlets created during installation of the Web Clipping WAR file

When the Web Clipping WAR file is installed, two associated portlets appear in the list of available portlets:

Web Portlet HTML Template
Web Clipping Editor

Web Portlet HTML Template is used as a template for new portlets created by the Web Clipping Editor and cannot be added to a page. Adding the Web Portlet HTML Template to a page will result in an error.
Web Clipping Editor is the GUI for creating and editing Web clipping portlets and can be successfully added to a page.

Clipping sites containing JavaScript

JavaScript is used for two primary reasons:

Make pages interactive through the use of an event response paradigm.
Generate dynamic content.
Procedures are executed client-side, within the user agent, and executed after the content has been retrieved from the server and returned to the user-agent, but before the document is rendered. This can be useful in generating content that is dependent on the user-agent or client environment.

Ideally, a given Web page would act within a Web clipping portlet just like it does in a stand-alone browser. In this respect, Web clipping is a sort of "Portlet Web Browser" and can be considered a unique user-agent that has unique restrictions and characteristics with respect to display and interaction mechanisms, especially with JavaScript.
All JavaScript on a the source Web page will be retained, unless otherwise specified using the option...
Remove all JavaScript security

JavaScript within the HEAD of a document will be relocated to the BODY of the document prior to any other children of the BODY.
No other modifications to JavaScript will be made automatically.

Runtime restrictions...

JavaScript that uses relative URLs will be broken due to the fact that these are not rewritten during URL rewriting. That is, URLs within JavaScript (relative or not) will not be modified.
JavaScript that depends on a specific hierarchy of a page structure using the DHTML models provided by various browsers may act unexpectedly depending on the situation.
JavaScript that depends on specific browser functionality may not be viewable within other browsers (for example, Netscape 6 functionality vs. Internet Explorer 6 functionality), may act unexpectedly, or not at all.

HTML Clipping restrictions...

All <SCRIPT> blocks defined by the HTML <SCRIPT> element and JavaScript within the <HEAD> element of the HTML document being clipped will be removed.
All JavaScripts, including all event handlers and embedded JavaScripts, are removed prior to displaying the HTML page being clipped. This means that for those scripts that generate content, the content will not be displayed in the HTML clipping editor and therefore cannot be clipped.
For the same reason, <SCRIPT> blocks that are located within the <BODY> element of the HTML document cannot be individually retained. They may be retained implicitly if an element that contains the <SCRIPT> element within the document hierarchy is selected to be retained.
For example, if the <BODY> directly contains a <SCRIPT> element child that generates some content, and the <BODY> element is not selected to be retained, the SCRIPT will be lost. However, if a <TD> element within the document contained a <SCRIPT> element that generated content for the <TD>, and the <TD> element is selected to be retained, the <SCRIPT> would be retained as well (barring the Remove JavaScript security constraint switch).
JavaScript within HTML implicit event handlers (such as onLoad, onMouseOver, and onKeyDown) will only be retained if the element which defines the attribute is retained.
JavaScript embedded within HREF attributes (using the JavaScript: prefix or &{...} syntax) will only be retained if the element which defines the attribute is retained.

In general, it is not a good idea to use the clipping type...
HTML Clipping

...together with pages with JavaScript. Instead, use the clipping type...
Keep All Content

...to integrate these types of pages.

Double-byte character set limitation

If a Web page you are trying to clip does not contain a charset or contains a charset that is not supported by the Web Clipping Editor, then the Web Clipping Editor defaults to the ISO-8859 charset. In this case, double-byte character set characters may not be displayed correctly.

HTML clipping limitations and restrictions

No content appearing in the Web clipping portlet

Due to the limitations of the HTML parser used by the Web Clipping Editor, certain pages with excessively malformed HTML cannot be fixed for proper display. In such cases, unexpected results may occur or no content may appear within the Web clipping portlet.

<FRAME> elements

The HTML FRAME support consists of the following:

Enablement of all existing HTML-based Web clipping portlets to navigate to pages containing FRAME or FRAMESET elements. That is, if you have any existing Web clipping portlets and somewhere within the content of those portlets is a link to a page that contains FRAMESET elements, the link can now be traversed and the content displayed and navigated.
Creation of new HTML-based Web clipping portlets against pages that contain FRAMESET or FRAME elements using "Keep All Content" mode only.
Both Inter and Intra FRAME navigation is supported on pages with FRAMEs, just as in a desktop browser.

Note the following restrictions concerning HTML FRAME support:

FRAME tags that include the onload and onunload attributes for executing JavaScript functions are not preserved when converting the FRAME to a table cell. There is no support for those attributes on table cell (<TD>) elements.
The "Keep All Content" mode is required during creation of new Web clipping portlets directly referencing pages containing FRAMEs.
Web clipping portlets can be created from content containing FRAMES only if the "Keep All Content" mode is used as opposed to the "HTML Clipping" mode. You can continue to create Web clipping portlets that indirectly reference pages containing FRAMES or FRAMESETs through a link, however you may not clip those pages as you can with pages referenced that do not contain FRAMEs. FRAME navigation is not supported in the editor (on the Finish page or HTML clipping page) as a result of this restriction.
Links in the created portlets cannot be followed if those pages contain embedded FRAMEs. For new portlets that contain FRAMEs and new or existing portlets that indirectly reference pages with FRAMEs, the links in that portlet can be navigated as usual. However, indeterminate results will occur if the links also contain FRAMEs, that is if they reference a page that contains new FRAMESET or FRAME elements.
FRAME support is not provided for non-HTML user agents. Currently, the successful presentation and navigation of pages containing FRAMEs will work only for portlets that are targeted to be used with HTML-based user agents that support HTML conforming to the HTML 3.2 specification and above. Web clipping portlets that encounter pages with FRAMESET or FRAME elements cannot be viewed from mobile devices or non-HTML devices.

iFrames

The Web Clipping portlet was modified to be able to perform the equivalent function of an older WebPage/IFRAME portlet. The following set of restrictions apply when the Web Clipping portlet is configured to display content embedded in an iFrame and is configured to allow the browser/user-agent to access referenced resources directly.

The portlet will not authenticate when viewed from Internet Explorer with the MS04-004 Cumulative Security Update applied.
The MS04-004 update disables the ability to use URLs of the format...
http://username:password@www.acme.com/login.jsp

This format was disabled for security purposes, as the user ID and password appear in cleartext within the URL and can be easily compromised
The portal server and the server hosting the site for which authentication is required must be within the same top-level domain (for example, acme.com).
Due to security restrictions, browsers/user-agents will not accept cookies for server a.b.c if they are sent by server x.y.z or any other server outside the domain b.c. Doing so would allow potential spoof attacks to gain access to authenticated content on server a.b.c without its explicit consent. For this reason, the portal server on which the portlet resides and the server(s) hosting the site which requires authentication must be within the same top-level domain.
User-agents must access the portlet using fully-qualified domain when the portlet uses FORM-based authentication.
When end-users access the portlet from a browser, they must use the fully-qualified domain name in the address bar.
For example...
http://www.ibm.com:10040/wps/portal

Not using the fully qualified domain name will not work due to security restrictions in the browser with regards to cookies from alternative domains than the server on which the response originates.

Element selection issues

As you work with HTML clipping, you might have difficulties selecting the page elements you want to clip. This is most noticeable, for example, when you want to clip an entire table, but that table does not have a border or any other visual elements that you can use to select it. Instead, you are forced to select all the columns within the table individually and end up with the correct data but the wrong format (not grouped together in the original table).
The only workaround to this is by is trial and error, clicking or selecting different areas of the rendered output to clip and then previewing the contents to see if you achieved the desired result. This process is made easier by using the preview function that lets you view the results of each selection attempt without having to go through the process of adding the Web clipping portlet to a page and examining its contents.
The HTML clipping tool allows you to select one element and then toggle elements contained within that element, making it appear as though you can keep all the content of a selected item, except for one or two of the elements that it contains. As useful as this can be, you cannot do it using the selection method described above. Instead, if you want to keep the entire contents of a single element except for a few of its sub-elements, you have to individually select just the sub-elements you want to keep.
For example, if you want to keep all the contents of a <TABLE> except for the contents of one <TD> element, you cannot select the <TABLE> element and then select the <TD> that you do not want. Instead, select all the <TD>s that you do want. Unfortunately, as mentioned before, because you are forced to select the <TD> elements individually, the data will not be grouped as it was in the original table.

Parent topic:
Use Web clipping to import content