The XML SAX Parser is based on the Apache Xerces library. It is used for reading large sized XML documents that the DOM based XML parser won't be able to handle because of memory constraints. It extracts data enclosed within the 'Group tag' supplied in the configuration and creates an Entry with the attributes present in the data. We can specify multiple group tags by separating each tag name with a comma. This will cause the SAX parser to break on any the tags specified. When specifying multiple group tags the SAX parser will use a first-in-win approach where the group tag that was first encountered will be tag that closes the group. As an example, if you have A and B as group tags and the document has a structure where B is a child of A, then A will be the tag closing the entry (as A is found before B and thus takes precedence).
Once a group tag has been found, then any nested occurrence of group tags will have no effect on the current Entry.
If no group tags have been defined, the entire XML document will be returned as a single Entry.
The entry attribute name is composed of surrounding tag names with "@" as the separator. For example, consider the following XML file -
<?xml version="1.0" encoding="UTF-8"?> <DocRoot> <Entry> <Company> <Name incorporated="yes">IBM Corporation</Name> <Country>USA</Country> </Company> </Entry> <Entry> <Company> <Name incorporated="no">Smith Brothers</Name> <Country>USA</Country> </Company> </Entry> </DocRoot>
Using "Entry" as the GroupTag, the above XML document would yield two entries as follows -
Attribute name: DocRoot@Entry@Company@Name Attribute value: IBM Corporation Attribute name: DocRoot@Entry@Company@Name#incorporated Attribute value: yes Attribute name:DocRoot@Entry@Company@Country Attribute value: USA
Attribute name: DocRoot@Entry@Company@Name#incorporated Attribute value: Smith Brothers Attribute name: DocRoot@Entry@Company@Name#incorporated Attribute value: no Attribute name:DocRoot@Entry@Company@Country Attribute value: USA
The attribute name may be shortened by specifying a 'Remove Prefix' value in the configuration. For example, a 'Remove Prefix' value of "DocRoot@Entry@Company" in the above example will result in the Entry containing attributes like -
Attribute name: Name Attribute value: IBM Corporation Attribute name: Name#incorporated Attribute value: yes Attribute name: Country Attribute value: USA ...
When the Connector is initialized, the XML Parser tries to perform Document Type Definition (DTD) verification if a DTD tag is present. The parser will read multi-valued attributes, although only one of the multi-value attributes will be shown when browsing the data in the Schema tab.
If the XML file has nested entry tags, all Entry tags enclosed within the outermost Entry tag, will be treated as normal XML tags. For example,
<entry> <entry> <company>IBM</company> </entry> </entry>
Here the entry will contain the following attribute:
attribute name: entry@entry@company attribute value: IBM
The default and recommended Character Encoding to use when deploying the XML SAX Parser is UTF-8. This will preserve data integrity of your XML data in most cases. When you are forced to use a different encoding, the Parser will handle the various encodings in the following way:
When reading a file the parser will look for encoding in the following order: