Hints and tips for using Portal Search

Hints and tips for using Portal Search

+

Search Tips | Advanced Search

Content Model has only one search collection

The Content Model Search Service has only one search collection, which is provided by default with the portal installation.
We cannot modify this Content Model search collection or create additional search collections under the Content Model Search Service. The content model search service is listed because we can include it in scopes.

Users cannot see portal site search results in their preferred language

If the preferred language of the crawler user ID does not match the language of the search collection, users might not see search results in their language. Therefore, set the preferred language of the portal site crawler user ID to match the language of the portal site search collection that it crawls.
If the portal site is multilingual and the users use different languages to search the portal, set the portal site collections up as described under

Use the Search Center with external search services with different languages

To use external search services such as Google and Yahoo! with an English search keyword, a URL such as the sample URL mentioned in the Search Center portlet help for configuring the portlet works fine as is:
http://www.google.com/search?q=

However, if you search in other languages, consult the documentation of the remote search service that you use to ensure that the Web interface is set up and used appropriately for the language that you use for the search. This can avoid problems with the displayed results, depending on the combination of languages set for the portal, the browser, and the search.

How Portal Search handles special characters when indexing
Portal Search indexes words that are composed of consecutive literals, that is letters, digits, and special characters. This includes the following characters:

Hash or pound sign ( # ).
Percent sign ( % ).
Plus sign ( + ).
Asterisk ( * ).

During indexing special characters are handled as follows:

Blank or white space; this includes the tab

Blanks separate words and are not indexed. Example: The string key board is indexed as two separate words key and board.

Line break or new line

Line breaks separate words and are not indexed unless they are preceded by a dash ( - ). Examples:

The string
key board
is indexed as two separate words key and board.

The string
key- board

...is indexed as one word keyboard.

Dot or sentence end period (.) and comma (,)

Dots and commas separate words and are not indexed, unless they are both preceded and followed by a letter or digit. Example: The string www.ibm.com is indexed as www.ibm.com and not as three separate words.

Question mark ( ? ) and exclamation mark ( ! )

Question marks and exclamation marks separate words and are not indexed unless they are followed by a letter.

Other punctuation: ( ) { } [ ] < > ; : / \ | " _ -

These characters separate words and are not indexed.

Other characters

All other characters are removed from the strings in which they appear but do not separate words.

All characters that split words are discarded during indexing and searching.
The statements made above apply to indexing. However, in a search query all characters that can be part of the search syntax are treated in that capacity and not as part of the search query. These are...

plus ( + )
minus ( - )
double quotation marks ( " )
asterisk wild card character ( * )

If users want to include such characters in their search query, they must enclose them in double quotation marks. For example "+hello" searches for the string +hello; "*Hello*" searches for the string *Hello*.

Time required for crawls and imports and availability of documents

The following search administration tasks can require extended periods of time:

Crawling a content source.
During the crawl documents might not be immediately available for searching or browsing.

Indexing the documents fetched by a crawl.
When a crawl has been complete and all documents have been collected, building the index takes some more time.

Import a search collection.
When you import data to a collection, it can take some time until the content sources for the collection are shown in the Content Sources in Collection box and the documents of the imported collection are available for crawling.

These tasks are put in a queue. It might therefore take several minutes until they are executed and the respective time counters start, for example, the crawl Run time and the timeout for the crawl set by the option...
Stop collecting after (minutes):

The time required for these tasks is further influenced by the following factors:

The number of documents in the content source that is being crawled

The size of the documents in the content source that is being crawled

Speed and availability of the processors, hard drive storage systems, and network connection.

The value that you selected from the...
Stop collecting after (minutes):

...drop-down menu when you created or edited the content source.

Therefore both the time limits that we can specify and the times that are shown for these processes work as fuzzy time limits. This applies, for example, to the following scenarios:

When you start a crawl by selecting a content source in the Content Sources in Collection box and clicking Start collecting.

When you import a search collection and when you start a crawl on the imported search collection.

When a portal installation is complete and you initialize the pre-configured portal site collection by selecting the portal site content source and clicking Start collecting.

The time shown under Last update completed in the collection status information is later than you might assume by just adding the crawler time limit specified by Stop collecting after (minutes): to the crawling start time. This delay is caused by the additional time required by building the index.

Furthermore, this influences other status indicators given in the Manage Search portlet. For example, the number of documents shown for a content source can show with an unexpectedly low figure or even at zero ( 0 ) until the crawl on that content source has been completed.

Memory required for crawls

Crawling can require large amounts of memory. This depends on the Portal Search environment. Therefore, before you start a crawl, make sure that the portal has enough free memory. Memory shortage can cause a corrupted search collection and eventually lead to a system freeze.
To resolve this problem, raise the limit to the number of open files by using the ulimit command as root administrator.

Uninstalling the portal does not delete search collections

When you uninstall WebSphere Portal, the directories and files for the search collections are not deleted. Therefore, before you uninstall the portal, delete all search collections by selecting the collections individually and clicking the option...
Delete Collection

If you do not do this, these files and directories remain on the hard drive.
To delete the search collection data after uninstalling the portal, do this manually. The directory path of a search collection is determined by what you typed in the field Location of Collection when you created the search collection.
To look up the collection location, click...
Administration | Search Administration portlet | Search Collections box | collection_name | Collection Status | Collection location

HTTP crawler does not support JavaScript

The HTTP crawler of the Portal Search Service does not support JavaScript. Therefore some text of Web documents might not be accessible for search by users. This depends on how the text is prepared for presentation in the browser. Specifically text that is generated by JavaScript might or might not be available for search.

UNIX operating systems might require higher limit of open files for Portal Search to work properly
The limit for the number of open files in a UNIX operating system might be too low for Portal Search to work properly. This might result in a Portlet Unavailable error. To resolve this problem and allow a higher number of files to be handled, raise the limit to the number of open files by issuing the following command as root administrator:
ulimit -n 4096

Create the virtual portal site search collection fails

For any OS, but especially for UNIX systems, if the file path length for the location of search collections exceeds 118 characters, the collection cannot be created.
By default, the search collection for portal site content is created under...
portal_install/PortalServer/collections

Contributers to the length of the file path include...

Portal installation directory path
Search collection name
Virtual portal name

To fix...

Shorten path of the search collection location

Recreate the portal site search collection.

Increase JVM heap size when using categorizer
If a predefined categorizer is used with Portal Search, increase the JVM heap size to at least 1024 MB.

Start server1 and log in.

Navigate to...
Servers | Application Servers | WebSphere_Portal | Process Definition | Java Virtual Machine

Determine the configured maximum heap size; for example, this might be 512 MB.

Increase the maximum heap size to at least 1024 MB.

Restart the portal.

Search collection is unavailable for Search and Browse portlet

Problem:
A Search and Browse portlet cannot access the search collection to which you configured it.
Cause:
If you migrated from a previous version of portal, the parameter for specifying the target search collection has been changed in the configuration for the Search and Browse portlet. The parameter IndexName has been replaced by CollectionLocation.
Solution:
If you migrate from previous portal versions and have the Search and Browse portlet deployed, transfer the value from the old to the new parameter manually. For details about this refer to Migrating the Search and Browse portlet between portal versions.

Search collections unavailable in cluster if failover occurs

Problem:
If a cluster member in a portal cluster fails, users who were using the affected cluster member when the failover occurred can no longer access search collections. This can occur with horizontal scaling when a node fails or with vertical scaling when a particular cluster member fails.
Solution:
Users who are logged into the cluster member that failed must log out of the portal and then log back in before they will be able to access search collections again.

Search can return documents based on metadata

Search can return documents based on metadata of these documents, not just on words found in the fields or actual text of the document. It might appear to Portal Search users that their searches return documents which do not appear to match the search criteria.
Cause:
Meta-data for documents is also indexed for search. Therefore if the meta-data of documents matched the search criteria, these documents are also returned as results for the search.
Solution:
This works as designed and is usually considered to be of benefit.

Documents from deleted content source can remain available under scope

If you delete a content source, then the documents that were collected from this content source will remain available for search by users under all scopes which included the content source before it was deleted.
Cause:
These documents will be available until their expiration time ends.
Solution:
The expiration time can be specified under Links expire after (days): under General Parameters when you created the content source.

Parent topic:
Portal Search