Webserver search engine on HTTP Server

Webserver search engine on HTTP Server

This topic provides information about the Webserver search engine and national language considerations for the HTTP Server for i5/OS.
Information for this topic supports the latest PTF levels for HTTP Server for i5/OS . IBM recommends that you install the latest PTFs to upgrade to the latest level of the HTTP Server for i5/OS. Some of the topics documented here are not available prior to this update. See IBM Service for more information.
The Webserver search engine allows you to perform full text searches on HTML and text files. You can control what options are available to the user and how the search results are displayed through customized Net.Data^® macros. You can enhance search results by using the thesaurus support. For information on configuring the search engine with the HTTP Server (powered by Apache), see Setting up the Webserver search engine on HTTP Server (powered by Apache).

Parent topic:
Concepts of functions of HTTP Server

How it works

Before you can search, have an index. The index is a set of files that contain the contents of the documents (in a searchable form) that are to be searched. The search index is used by the search engine rather than searching all of the actual documents.
A search index is created based upon a document list. A document list contains a list of fully qualified path names of all the documents that you want to index.
Documents satisfying a search request are returned by default in their order of ranking. A document's ranking specifies the relevance with respect to the specified search conditions. The following factors determine a document's ranking:

Frequency of search terms in the document - As the search words appear more frequently in the document, the ranking gets higher.

Position of search terms in the document - As the search words appear closer to the beginning of the document, the ranking gets higher.

Frequency of search terms in the whole set of documents - As the search words appear less frequently within the documents in the entire index, the ranking for documents that have search words gets higher.

It is possible that a document with one search term appearing toward the beginning of the document can have a higher ranking than a document with multiple search terms appearing near the end of the document. The search function assumes that words indicating the subject or topic of the document usually appear near the beginning of the document. The highest ranking a document can have is 100%. A document can achieve a ranking of 100% if relatively few of the documents in the index contain the search terms. If many documents in the index contain the search terms, it is likely that none of the documents would achieve a ranking of 100%.
You can provide the following search functions through the customized Net.Data macros:

Exact search - 100% of the letters match. For example street returns street, Street, and STREET.

Fuzzy search - 60% of the letters match. For example street returns street, streets, treat, and Tree.

Wild card search - an asterisk (*) is replaced by zero or more letters and a question mark (?) is replaced by one letter. For example jump* returns jump, jumps, Jumping, and jumper.

Proximity search - two or more words in the same sentence.

English word stemming - for example, knife returns knife and knives.

Case sensitive search - for example, Street returns Street, not street.

Boolean search (simple) - for example, A and B and C.

Boolean Search (advanced) - for example, (A and (B or C) not D).

Document ranking - documents are automatically sorted according to ranking.

Thesaurus support - finds synonyms or related terms of a search word.

Search within results - search within returned search results only.

Simple and Advanced search

You can enhance search results through the use of the thesaurus support. A thesaurus contains words that are synonyms or related terms of a search word. For example, searching for Ping-Pong without thesaurus support results only in documents containing the string Ping-Pong. Using thesaurus support that includes synonyms for Ping-Pong, such as table tennis, results in documents containing either the string Ping-Pong or table tennis.
The URL mapping rules file, built from your selected HTTP Server, is used to set the URL for each document found on a search. It can specify the server port number (or instance) to use and can also map resulting file path names to external path names.

Sample files

Several files are shipped with the product for your use to customize your own Web search function:

File Description

/QIBM/ProdData/HTTP/Public/HTTPSVR/sample_search.ndm
Sample Net.Data macro that you can customize.
QIBM/ProdData/HTTP/Public/HTTPSVR/ thesaurus_sample_search.ndm
Sample Net.Data macro with thesaurus support that you can customize.
/QIBM/ProdData/HTTP/Public/HTTPSVR/sample_search.html
Sample search HTML file.
/QIBM/ProdData/HTTP/Public/HTTPSVR/HTML/
Directory of sample HTML files that you can use to build a test search index.
/QIBM/ProdData/HTTP/Public/HTTPSVR/sample_thesaurus.txt
Sample thesaurus definition file.

National language considerations

Documents that you are indexing can be encoded in most ASCII codepages and EBCDIC CCSIDs. Because the search engine does not support all CCSIDs, your documents might be converted to one of the supported CCSIDs during the indexing process. To see the CCSID used to index your documents, view the status of the search index.
Wildcard characters in search strings are not allowed for double byte languages. A wildcard search is implied for double byte languages. Both the name of the index and index directory name must be specified in a single byte characters. The contents of documents are often converted to one of the index CCSIDs listed below.
Documents in languages from the included character sets can all be contained in the same index, as long as the documents are indexed separately. For example, an index can contain English and French documents. Create the index including just the English documents, then update the index with the French documents. If you attempt to index Italian and Russian documents in the same index, an error will occur since the two languages cannot be converted to a common index CCSID. In this case you would need to create two separate indexes. The following table describes the supported CCSIDs for indexes.

Index CCSID Code page name Included character sets (CCSIDs)

500
Latin 1

International Albanian, Belgian English, Belgian French, Canadian French MNCS, Danish, Dutch, Dutch MNCS, English International, English US, Finnish, French (France), French MNCS, German (Germany), German MNCS, Icelandic, Italian, Latin 1/Open Systems, Norwegian, Portuguese (Brazil), Portuguese (Portugal), Swedish
838
Thai

Thai
870
Latin 2

Croatian, Czech, Hungarian, Polish, Romanian, Serbian (Latin), Slovak, Slovenia
1025
Cyrillic

Bulgarian, Macedonian, Russian, Serbian (Cyrillic)
1026
Latin 5

Turkish
875
Greek

Greek
424
Hebrew

Hebrew
420
Arabic

Arabic
1112
Baltic

Latvian, Lithuanian
1122
Estonian

Estonian
935
Simplified Chinese (GB)

Simplified Chinese (GB)
1388
Simplified Chinese (GBK)

Simplified Chinese (GBK)
937
Traditional Chinese

Traditional Chinese
5026 (930)
Japanese Katakana

Japanese Katakana
5035 (939)
Japanese Latin

Japanese Latin
1364 (933)
Korean

Korean

Browser and CL command interface for the Webserver search engine and Web crawler

This table shows the browser and CL command interface to all of the search engine and web crawling tasks.

Task Browser form CL command

Create an index

Create search index

CFGHTTPSCH OPTION(*CRTIDX)

Update an index

Update search index

CFGHTTPSCH OPTION(*ADDDOC)
CFGHTTPSCH OPTION(*RMVDOC)

Merge an index

Merge search index

CFGHTTPSCH OPTION(*MRGIDX)

Delete an index

Delete search index

CFGHTTPSCH OPTION(*DLTIDX)
V4R4 View the status of an index View status of search index:
CFGHTTPSCH OPTION(*PRTIDXSTS)

View the status of an index

View status of search index

CFGHTTPSCH OPTION(*PRTIDXSTS)
See spoolfile QPZHASRCH

Create a document list
Start the web crawler

Build a document list

CFGHTTPSCH OPTION(*CRTDOCL) - local
STRHTTPCRL OPTION(*CRTDOCL) - web crawler

Add documents to a document list

Build a document list

CFGHTTPSCH OPTION(*UPDDOCL)
Use for local documents.
STRHTTPCRL OPTION(*UPDDOCL)
Use for documents found with the web crawler.

Stop a web crawling session.

Work with document list status

ENDHTTPCRL

Pause a web crawling session.

Work with document list status

ENDHTTPCRL

Resume a web crawling session.

Work with document list status

RSMHTTPCRL

Register a document list created before V4R5

Register document list

CFGHTTPSCH OPTION(*REGDOCL)

Delete a document list

Delete document list

CFGHTTPSCH OPTION(*DLTDOCL)

Display information about a document list

Work with document list status

CFGHTTPSCH OPTION(*PRTDOCLSTS)
See spoolfile QPZHASRCH

Create a URL mapping rules file

Build URL mapping rules file

CFGHTTPSCH OPTION(*CRTMAPF)

Append a URL mapping rules file

Build URL mapping rules file

CFGHTTPSCH OPTION(*UPDMAPF)

Build a thesaurus dictionary

Build thesaurus dictionary

CFGHTTPSCH OPTION(*CRTTHSDCT)

Test a thesaurus dictionary

Test thesaurus dictionary

None.

Retrieve a thesaurus definition from a dictionary

Retrieve thesaurus definition

CFGHTTPSCH OPTION(*RTVTHSDFNF)

Delete a thesaurus dictionary

Delete thesaurus dictionary

CFGHTTPSCH OPTION(*DLTTHSDCT)

Create a list of URLs to crawl

Build URL object

CFGHTTPSCH OPTION(*CRTURLOBJ)

Update a list of URLs to crawl

Build URL object

CFGHTTPSCH OPTION(*UPDURLOBJ)

Delete a list of URLs to crawl

Delete URL object

CFGHTTPSCH OPTION(*DLTURLOBJ)

Create an object containing crawling attributes

Build options object

CFGHTTPSCH OPTION(*CRTOPTOBJ)

Update an object containing crawling attributes

Build options object

CFGHTTPSCH OPTION(*UPDOPTOBJ)

Build an object with userid and passwords for authentication

Build validation list

CFGHTTPSCH OPTION(*CRTVLDL)

Add userids and passwords for authentication.

Build validation list

CFGHTTPSCH OPTION(*ADDVLDLDTA)

Remove userids and passwords for authentication.

Build validation list

CFGHTTPSCH OPTION(*RMVVLDLDTA)

Search an index

Search index

None

File	Description
`/QIBM/ProdData/HTTP/Public/HTTPSVR/sample_search.ndm`	Sample Net.Data macro that you can customize.
`QIBM/ProdData/HTTP/Public/HTTPSVR/ thesaurus_sample_search.ndm`	Sample Net.Data macro with thesaurus support that you can customize.
`/QIBM/ProdData/HTTP/Public/HTTPSVR/sample_search.html`	Sample search HTML file.
`/QIBM/ProdData/HTTP/Public/HTTPSVR/HTML/`	Directory of sample HTML files that you can use to build a test search index.
`/QIBM/ProdData/HTTP/Public/HTTPSVR/sample_thesaurus.txt`	Sample thesaurus definition file.

Index CCSID	Code page name	Included character sets (CCSIDs)
`500`	Latin 1	International Albanian, Belgian English, Belgian French, Canadian French MNCS, Danish, Dutch, Dutch MNCS, English International, English US, Finnish, French (France), French MNCS, German (Germany), German MNCS, Icelandic, Italian, Latin 1/Open Systems, Norwegian, Portuguese (Brazil), Portuguese (Portugal), Swedish
`838`	Thai	Thai
`870`	Latin 2	Croatian, Czech, Hungarian, Polish, Romanian, Serbian (Latin), Slovak, Slovenia
`1025`	Cyrillic	Bulgarian, Macedonian, Russian, Serbian (Cyrillic)
`1026`	Latin 5	Turkish
`875`	Greek	Greek
`424`	Hebrew	Hebrew
`420`	Arabic	Arabic
`1112`	Baltic	Latvian, Lithuanian
`1122`	Estonian	Estonian
`935`	Simplified Chinese (GB)	Simplified Chinese (GB)
`1388`	Simplified Chinese (GBK)	Simplified Chinese (GBK)
`937`	Traditional Chinese	Traditional Chinese
`5026 (930)`	Japanese Katakana	Japanese Katakana
`5035 (939)`	Japanese Latin	Japanese Latin
`1364 (933)`	Korean	Korean

Task	Browser form	CL command
Create an index	Create search index	CFGHTTPSCH OPTION(*CRTIDX)
Update an index	Update search index	CFGHTTPSCH OPTION(ADDDOC) CFGHTTPSCH OPTION(RMVDOC)
Merge an index	Merge search index	CFGHTTPSCH OPTION(*MRGIDX)
Delete an index	Delete search index	CFGHTTPSCH OPTION(DLTIDX) V4R4 View the status of an index View status of search index: CFGHTTPSCH OPTION(PRTIDXSTS)
View the status of an index	View status of search index	CFGHTTPSCH OPTION(*PRTIDXSTS) See spoolfile QPZHASRCH
Create a document list Start the web crawler	Build a document list	CFGHTTPSCH OPTION(CRTDOCL) - local STRHTTPCRL OPTION(CRTDOCL) - web crawler
Add documents to a document list	Build a document list	CFGHTTPSCH OPTION(UPDDOCL) Use for local documents. STRHTTPCRL OPTION(UPDDOCL) Use for documents found with the web crawler.
Stop a web crawling session.	Work with document list status	ENDHTTPCRL
Pause a web crawling session.	Work with document list status	ENDHTTPCRL
Resume a web crawling session.	Work with document list status	RSMHTTPCRL
Register a document list created before V4R5	Register document list	CFGHTTPSCH OPTION(*REGDOCL)
Delete a document list	Delete document list	CFGHTTPSCH OPTION(*DLTDOCL)
Display information about a document list	Work with document list status	CFGHTTPSCH OPTION(*PRTDOCLSTS) See spoolfile QPZHASRCH
Create a URL mapping rules file	Build URL mapping rules file	CFGHTTPSCH OPTION(*CRTMAPF)
Append a URL mapping rules file	Build URL mapping rules file	CFGHTTPSCH OPTION(*UPDMAPF)
Build a thesaurus dictionary	Build thesaurus dictionary	CFGHTTPSCH OPTION(*CRTTHSDCT)
Test a thesaurus dictionary	Test thesaurus dictionary	None.
Retrieve a thesaurus definition from a dictionary	Retrieve thesaurus definition	CFGHTTPSCH OPTION(*RTVTHSDFNF)
Delete a thesaurus dictionary	Delete thesaurus dictionary	CFGHTTPSCH OPTION(*DLTTHSDCT)
Create a list of URLs to crawl	Build URL object	CFGHTTPSCH OPTION(*CRTURLOBJ)
Update a list of URLs to crawl	Build URL object	CFGHTTPSCH OPTION(*UPDURLOBJ)
Delete a list of URLs to crawl	Delete URL object	CFGHTTPSCH OPTION(*DLTURLOBJ)
Create an object containing crawling attributes	Build options object	CFGHTTPSCH OPTION(*CRTOPTOBJ)
Update an object containing crawling attributes	Build options object	CFGHTTPSCH OPTION(*UPDOPTOBJ)
Build an object with userid and passwords for authentication	Build validation list	CFGHTTPSCH OPTION(*CRTVLDL)
Add userids and passwords for authentication.	Build validation list	CFGHTTPSCH OPTION(*ADDVLDLDTA)
Remove userids and passwords for authentication.	Build validation list	CFGHTTPSCH OPTION(*RMVVLDLDTA)
Search an index	Search index	None