Portal, Express Beta Version 6.1
Operating systems: i5/OS, Linux,Windows |
Portal Search supports all national languages that are supported by the portal.
When you create a search collection, you can select the language for which the collection is optimized. The index uses this language to analyze the documents when indexing, if no other language is specified for the document. This feature enhances the quality of search results for users, as it allows them to use spelling variants, including plurals and inflections, for the search keyword. Portal search uses this language for indexing if there is no language defined for the document.
Portal Search can index content stored in different languages and make it available for search. It uses the unicode setting of the source content to crawl and index content for search. It supplies a choice of tokenizers selectable by administrators: N-gram indexing and linguistic indexing. N-grams are sequences of n consecutive characters in a document. N-grams are generated from a document by sliding a "window" across the text of the document, moving it by one character at a time. N-grams have several advantages over words for use in indexing. First, they are language independent, therefore mixed text can be indexed easily. They are useful for Asian languages in which word tokenization is more difficult, for example Chinese, Japanese, Korean, and Thai. Linguistic indexing is based on a morphological analyzer that reduces terms to their base. It can be usefully applied in most situations when indexing sources with both English and non-English content.
The Portal Search summarizer produces summaries for all languages that are supported by the portal. For some languages the summarizer has access to a stemmer program. It uses stems as the base forms for words, as opposed to the lemma forms used by summarizers which have dictionaries. Summaries for these languages can have better quality. Currently the stemmer program is available for the following languages:
|
|
|