Predefined static taxonomy and categorizer

 

+

Search Tips   |   Advanced Search

 

Portal Search provides a predefined categorization facility, allowing categorization of documents by different subjects.

The Portal Search Categorization Facility allows high-accuracy categorization of documents in any of over 2,300 subjects. These subjects are grouped in the following main business category areas:

  • Architecture, Construction, and Real Estate
  • Computers
  • Entertainment, Media, and News
  • Environment, Energy and Mining
  • Finance
  • Food and Beverage
  • General Business
  • Hospitality and Travel
  • Military, Aerospace, and Security
  • Other Industries
  • Sales, Marketing, and Advertising
  • Telecommunications and Consumer Electronics
  • Transportation.

Portal users can use the Categorization Facility to build applications that automatically determine the subject of documents which fall within any of these areas. The portal Categorization Facility is ready to use as supplied. It can also be customized to the business. It can evaluate and categorize documents in the languages English, French, Italian, and German.

The portal Categorization Facility consists of two major components...

  • Categorizer
  • Taxonomy Manager

We can customize the portal model-based Categorizer by creating additional categories, using two methods...

  • Create product name categories
  • Create synonyms for the categories

The Portal Search Taxonomy Manager portlet manages the predefined static taxonomy.

This topic includes the description of methods for creating additional categories and synonyms. The categorizer looks for an exact match, including capitalization.

If you use a predefined categorizer, increase the JVM heap size to at least 1024 MB.

 

Product Name Categories

For a product name category we can choose any word or phrase, but you would most commonly use the names of the company's products or services.

Create one category for each product or group of products. For example, we can create a new category named "WebSphere Portal" using the WebSphere Portal Taxonomy Manager. By default this creates a model for that category consisting of the phrase "WebSphere Portal".

The categorizer then looks for occurrences of that phrase in all documents and counts the number of such occurrences. The categorizer multiplies the number of occurrences with the weight assigned to that phrase to compute a score.

If the calculated score is greater than or equal to the current value of MinUserCatScore, then the categorizer reports that the document belongs to that category. A given document can belong to more than one Product Name Category.

 

Synonyms

We can assign any number of synonyms to the standard set of categories shipped with WebSphere Portal or the product name categories. We can also assign synonyms to interior nodes of the taxonomy. Each synonym is used to help the categorizer identify other instances of that category.

Common synonyms can be other spellings or capitalization patterns. They can also just be other phrases that signify a particular category. For example, if the documents you categorize often use the name of a product in all capital letter spelling, you create a synonym such as WEBSPHERE PORTAL.

The best way to decide whether we need a synonym is to examine documents to see what forms of the category name are used in practice. At the time you create the synonym, you are prompted to assign a weight to it. The categorizer multiplies the number of occurrences of a synonym in a particular document with that weight to calculate a score, and adds it to the score for that category.

Example: A document is to be categorized. The categorizer reports the two top categories as...

  • Drinking Water Protection
  • Drinking Water Treatment

...with scores of 0.24 and 0.25, respectively. You assign "watershed protection" with a weight of 0.05 as a synonym to "Drinking Water Protection". If this new synonym is found once in this document, this alters the scores to 0.29 and 0.25, respectively. Consequently, the "best" answer from the categorizer is now "Drinking Water Protection."

If you find that a category does not find all desired documents on a particular topic, add synonyms. We can assign the desired weight to each synonym. However, in general you may find it best to use a weight of no more than 50% of the MinUserCatScore for synonyms to product name categories, and no more than 50% of the MinCatCos for synonyms to the standard WebSphere Portal categories. This ensures that a document must contain at least two mentions of a synonym to be categorized as belonging to that category.

 

Categorizer Parameters

The categorizer has a number of adjustable parameters. They can be set to achieve various results. The parameters are controlled by entries in the file...

ModelCategorizer.properties

You find this file in the following location...

portal_server_root/shared/app/eureka/resources/LL/CategorizerModel-yyyy-mmm-dd-LL-wps.zip

...where LL indicates the language code, such as fr, en, it, or de. An example is...

CategorizerModel-2003-Jul-10-en-wps.zip

The settings in the file supplied with the portal are configured with values for the best general usage. However, advanced administrators may decide to modify them. To modify the properties file, extract it from the ZIP file and modify it. Then leave the properties file in the same directory where the ZIP file is. You do not need to replace the properties file in the ZIP file with the new one.

The default settings for the parameters in the properties file are as follows:

Super category threshold

MinSuperCatCos = 0.05

Category threshold

MinCatCos = 0.24

Value by which the 2nd and 3rd cosines must be in order to remain part of the result set

SuperCatProximity = 0.04

Minimum score allowed for user categories in the ProperName Categorizer

MinUserCatScore = 0.20

The parameters and their settings are explained in the following:

MinSuperCatCos

Super category threshold.

The MinSuperCatCos value is a number between 0 and 1. Typical values are between 0.05 and 0.15.

The higher the value, the more stringent the categorizer is in determining the super category, or collection of categories, to which the document belongs. For shorter documents or for less professionally written documents, use a value closer to 0.05; for longer and more professional documents, use a higher value. Web pages often tend toward the shorter and less professional side; for those a setting of 0.05 is recommended. In any case, the value should be substantially lower than MinCatCos.

MinCatCos

Category threshold.

Number between 0 and 1. Typical values are between 0.15 and 0.27.

The higher the value is, the more stringent the categorizer is in determining the category to which the document belongs. Typical Web pages categorize best with a value of 0.24; however, short documents may categorize well with a lower value. Values slightly above 0.24 may be appropriate for single-topic documents that are professionally authored and of significant length, that is, several hundred words.

SuperCatProximity

Value by which the second and third cosines must be in order to remain part of the result set. Number between 0 and 1. Typical values are in the range of 0.01 to 0.08.

The higher the value is, the more likely the categorizer is to consider a broader set of super categories. Generally, this should be left at the default setting of 0.04.

MinUserCatScore

Minimum score allowed for user categories in the ProperName Categorizer.

Applies to the user-created model data as described in the Customization section above.

Can have a value between zero (0) and infinity. The higher the value is, the more stringent the categorizer is in determining the product name category to which the document belongs.

A document is assigned to a product name category when the product name score for that category is at or above...

MinProperNameEurekaScore

As the default score for each newly created product name category entry is 0.1, the default threshold of 0.2 implies that the Product Name Category must occur at least twice in the document for the document to be scored as belonging to that Product Name Category.

 

See also:

Taxonomy Manager
Configure the Taxonomy Manager portlet

 

Parent topic:

Categorizers and taxonomies