Webserver search engine on HTTP Server
This topic provides information about the Webserver search engine and national language considerations for the HTTP Server for i5/OS.
Information for this topic supports the latest PTF levels for HTTP Server for i5/OS . IBM recommends that you install the latest PTFs to upgrade to the latest level of the HTTP Server for i5/OS. Some of the topics documented here are not available prior to this update. See IBM Service for more information.
The Webserver search engine allows you to perform full text searches on HTML and text files. You can control what options are available to the user and how the search results are displayed through customized Net.Data® macros. You can enhance search results by using the thesaurus support. For information on configuring the search engine with the HTTP Server (powered by Apache), see Setting up the Webserver search engine on HTTP Server (powered by Apache).
Parent topic:
Concepts of functions of HTTP Server
How it works
Before you can search, have an index. The index is a set of files that contain the contents of the documents (in a searchable form) that are to be searched. The search index is used by the search engine rather than searching all of the actual documents.
A search index is created based upon a document list. A document list contains a list of fully qualified path names of all the documents that you want to index.
Documents satisfying a search request are returned by default in their order of ranking. A document's ranking specifies the relevance with respect to the specified search conditions. The following factors determine a document's ranking:
- Frequency of search terms in the document - As the search words appear more frequently in the document, the ranking gets higher.
- Position of search terms in the document - As the search words appear closer to the beginning of the document, the ranking gets higher.
- Frequency of search terms in the whole set of documents - As the search words appear less frequently within the documents in the entire index, the ranking for documents that have search words gets higher.
It is possible that a document with one search term appearing toward the beginning of the document can have a higher ranking than a document with multiple search terms appearing near the end of the document. The search function assumes that words indicating the subject or topic of the document usually appear near the beginning of the document. The highest ranking a document can have is 100%. A document can achieve a ranking of 100% if relatively few of the documents in the index contain the search terms. If many documents in the index contain the search terms, it is likely that none of the documents would achieve a ranking of 100%.
You can provide the following search functions through the customized Net.Data macros:
- Exact search - 100% of the letters match. For example street returns street, Street, and STREET.
- Fuzzy search - 60% of the letters match. For example street returns street, streets, treat, and Tree.
- Wild card search - an asterisk (*) is replaced by zero or more letters and a question mark (?) is replaced by one letter. For example jump* returns jump, jumps, Jumping, and jumper.
- Proximity search - two or more words in the same sentence.
- English word stemming - for example, knife returns knife and knives.
- Case sensitive search - for example, Street returns Street, not street.
- Boolean search (simple) - for example, A and B and C.
- Boolean Search (advanced) - for example, (A and (B or C) not D).
- Document ranking - documents are automatically sorted according to ranking.
- Thesaurus support - finds synonyms or related terms of a search word.
- Search within results - search within returned search results only.
- Simple and Advanced search
You can enhance search results through the use of the thesaurus support. A thesaurus contains words that are synonyms or related terms of a search word. For example, searching for Ping-Pong without thesaurus support results only in documents containing the string Ping-Pong. Using thesaurus support that includes synonyms for Ping-Pong, such as table tennis, results in documents containing either the string Ping-Pong or table tennis.
The URL mapping rules file, built from your selected HTTP Server, is used to set the URL for each document found on a search. It can specify the server port number (or instance) to use and can also map resulting file path names to external path names.
Sample files
Several files are shipped with the product for your use to customize your own Web search function:
File Description /QIBM/ProdData/HTTP/Public/HTTPSVR/sample_search.ndm Sample Net.Data macro that you can customize.
QIBM/ProdData/HTTP/Public/HTTPSVR/ thesaurus_sample_search.ndm Sample Net.Data macro with thesaurus support that you can customize.
/QIBM/ProdData/HTTP/Public/HTTPSVR/sample_search.html Sample search HTML file.
/QIBM/ProdData/HTTP/Public/HTTPSVR/HTML/ Directory of sample HTML files that you can use to build a test search index.
/QIBM/ProdData/HTTP/Public/HTTPSVR/sample_thesaurus.txt Sample thesaurus definition file.
National language considerations
Documents that you are indexing can be encoded in most ASCII codepages and EBCDIC CCSIDs. Because the search engine does not support all CCSIDs, your documents might be converted to one of the supported CCSIDs during the indexing process. To see the CCSID used to index your documents, view the status of the search index.
Wildcard characters in search strings are not allowed for double byte languages. A wildcard search is implied for double byte languages. Both the name of the index and index directory name must be specified in a single byte characters. The contents of documents are often converted to one of the index CCSIDs listed below.
Documents in languages from the included character sets can all be contained in the same index, as long as the documents are indexed separately. For example, an index can contain English and French documents. Create the index including just the English documents, then update the index with the French documents. If you attempt to index Italian and Russian documents in the same index, an error will occur since the two languages cannot be converted to a common index CCSID. In this case you would need to create two separate indexes. The following table describes the supported CCSIDs for indexes.
Index CCSID Code page name Included character sets (CCSIDs) 500 Latin 1
International Albanian, Belgian English, Belgian French, Canadian French MNCS, Danish, Dutch, Dutch MNCS, English International, English US, Finnish, French (France), French MNCS, German (Germany), German MNCS, Icelandic, Italian, Latin 1/Open Systems, Norwegian, Portuguese (Brazil), Portuguese (Portugal), Swedish
838 Thai
Thai
870 Latin 2
Croatian, Czech, Hungarian, Polish, Romanian, Serbian (Latin), Slovak, Slovenia
1025 Cyrillic
Bulgarian, Macedonian, Russian, Serbian (Cyrillic)
1026 Latin 5
Turkish
875 Greek
Greek
424 Hebrew
Hebrew
420 Arabic
Arabic
1112 Baltic
Latvian, Lithuanian
1122 Estonian
Estonian
935 Simplified Chinese (GB)
Simplified Chinese (GB)
1388 Simplified Chinese (GBK)
Simplified Chinese (GBK)
937 Traditional Chinese
Traditional Chinese
5026 (930) Japanese Katakana
Japanese Katakana
5035 (939) Japanese Latin
Japanese Latin
1364 (933) Korean
Korean
Browser and CL command interface for the Webserver search engine and Web crawler
This table shows the browser and CL command interface to all of the search engine and web crawling tasks.
Task Browser form CL command Create an index
Create search index
CFGHTTPSCH OPTION(*CRTIDX)
Update an index
Update search index
CFGHTTPSCH OPTION(*ADDDOC)
CFGHTTPSCH OPTION(*RMVDOC)
Merge an index
Merge search index
CFGHTTPSCH OPTION(*MRGIDX)
Delete an index
Delete search index
CFGHTTPSCH OPTION(*DLTIDX)
V4R4 View the status of an index View status of search index:
CFGHTTPSCH OPTION(*PRTIDXSTS)
View the status of an index
View status of search index
CFGHTTPSCH OPTION(*PRTIDXSTS)
See spoolfile QPZHASRCH
Create a document list
Start the web crawler
Build a document list
CFGHTTPSCH OPTION(*CRTDOCL) - local
STRHTTPCRL OPTION(*CRTDOCL) - web crawler
Add documents to a document list
Build a document list
CFGHTTPSCH OPTION(*UPDDOCL)
Use for local documents.
STRHTTPCRL OPTION(*UPDDOCL)
Use for documents found with the web crawler.
Stop a web crawling session.
Work with document list status
ENDHTTPCRL
Pause a web crawling session.
Work with document list status
ENDHTTPCRL
Resume a web crawling session.
Work with document list status
RSMHTTPCRL
Register a document list created before V4R5
Register document list
CFGHTTPSCH OPTION(*REGDOCL)
Delete a document list
Delete document list
CFGHTTPSCH OPTION(*DLTDOCL)
Display information about a document list
Work with document list status
CFGHTTPSCH OPTION(*PRTDOCLSTS)
See spoolfile QPZHASRCH
Create a URL mapping rules file
Build URL mapping rules file
CFGHTTPSCH OPTION(*CRTMAPF)
Append a URL mapping rules file
Build URL mapping rules file
CFGHTTPSCH OPTION(*UPDMAPF)
Build a thesaurus dictionary
Build thesaurus dictionary
CFGHTTPSCH OPTION(*CRTTHSDCT)
Test a thesaurus dictionary
Test thesaurus dictionary
None.
Retrieve a thesaurus definition from a dictionary
Retrieve thesaurus definition
CFGHTTPSCH OPTION(*RTVTHSDFNF)
Delete a thesaurus dictionary
Delete thesaurus dictionary
CFGHTTPSCH OPTION(*DLTTHSDCT)
Create a list of URLs to crawl
Build URL object
CFGHTTPSCH OPTION(*CRTURLOBJ)
Update a list of URLs to crawl
Build URL object
CFGHTTPSCH OPTION(*UPDURLOBJ)
Delete a list of URLs to crawl
Delete URL object
CFGHTTPSCH OPTION(*DLTURLOBJ)
Create an object containing crawling attributes
Build options object
CFGHTTPSCH OPTION(*CRTOPTOBJ)
Update an object containing crawling attributes
Build options object
CFGHTTPSCH OPTION(*UPDOPTOBJ)
Build an object with userid and passwords for authentication
Build validation list
CFGHTTPSCH OPTION(*CRTVLDL)
Add userids and passwords for authentication.
Build validation list
CFGHTTPSCH OPTION(*ADDVLDLDTA)
Remove userids and passwords for authentication.
Build validation list
CFGHTTPSCH OPTION(*RMVVLDLDTA)
Search an index
Search index
None