XML Parser

IBM Tivoli Directory Integrator

XML Parser

This XML Parser is introduced for the first time in TDI v7.0. It uses the XLXP implementation of the StAX (JSR-173) specification. StAX is a cursor based XML parser, capable of both reading and writing XML.

The traditional DOM-based Parser available in older versions of TDI has been renamed, and is now available as the Simple XML Parser. The new XML Parser is deemed a replacement for the older component, and you are encouraged to migrate your older Configs to use the new Parser.

Introduction

A Connector uses the XML Parser to either retrieve a TDI Entry object from source XML or output a TDI Entry object as XML. The XML Parser uses the StAX cursor based parser internally. In previous versions of the TDI XML Parser (now the Simple XML Parser) the DOM mechanism was used for parsing a XML. The main advantages of the StAX implementation is that now the TDI Parser is much faster because it does not need to load the whole XML structure in memory like DOM does. Because of its memory efficiency the StAX implementation is more suited when the TDI solution is supposed to deal with unusually large XML structures.

The only drawback of this memory efficient mechanism of parsing an XML data is that no random element access is available since all StAX does is running through an XML structure and pulls one element at a time. Depending on the configuration of the TDI XML Parser each one of the elements pulled out could be either skipped or put in an Entry with Attributes representing each element being pulled out of the XML.

Configuration

The XML Parser has the following configurable parameters:

Simple XPath

Contains the value used (an XPath-like expression) to discover elements to interpret them as entries. This parameter is also used to display the structure of the XML document to be written.

Entry Tag

Holds the name of the element that will wrap each entry passed to the XML Parser.

Value Tag

Holds the name of the element that will wrap each attribute value passed to the XML Parser.

Prefix to Namespace Map

Mappings between <prefix>=<namespace> separated by the pipe char (|). If the prefix starts with $ it will be considered as a default namespace declaration. The default value is "<prefix>=<namespace>".

XSD Schema Location

The schema location, used for display purposes only.

Character Encoding

Character encoding to use when reading or writing. The default is UTF-8; also see Character Encoding in the XML Parser.

Static Attributes Declaration

Used to declare attributes and prefixes. They will be written with the static elements read form the Simple XPath parameter. This is a text area, and the default is:

<!-- this is an example for statically declared XML attributes/namespaces --> 
<!-- DocRoot xmlns="defaultNS" attr1="val2">   <Entry xmlns:p1="p1NS" p1:attr2="val2" />
</DocRoot-->

Ignore repeating XML declarations while reading

Check this to always acknowledge the first XML declaration (if any), any subsequent other ones will be ignored. The default value is unchecked.

Coalescing

If checked, then the Parser will coalesce adjacent character data sections. The default value is unchecked.

Omit XML declarations when writing

Check this to suppress writing an XML declaration to the output. Useful for appending to an existing XML file. The default value is unchecked.

Multi-rooted Document

If checked, output each Entry as a standalone element. This will create a multi-rooted document. The default value is unchecked.

Indent Output

If this field is checked, then the XML output is indented. The default value is checked.

Detailed log

Check this to generate more detailed log messages.

Using the Parser

Navigation through the XML structure

The XML Parser recognizes very simple XPath expressions. According to the expression the parser finds and returns an Entry that will either contain a single Attribute object representing the element itself or multiple Attribute objects in case the wrapping/unwrapping function of the parser is utilized. Current XPath implementations require random element access (over an Object Model) to pinpoint the element(s) referred by the XPath expression. Since a StAX parser does not provide this feature (random element access) it can only work with simple XPath expressions like these:

/root/container1/container2/entry
/root/prefix:container/entry
/root/$prefix:container/
/root/*/entry
/root/prefix:*
/root/$prefix:*/entry

Navigation when reading

We can provide several simple paths if the structure of the XML is quite complex. Each XPath expression is separated from the previous using the pipe char - "|". Each expression is used for finding elements in the XML document. By default the XML Parser is able to work with XMLs with two-levels in depth, just like the Simple XML Parser can. In additionally the XML Parser provides an easy way for working with arbitrary deep and complex hierarchical structures. For more details, take a look at these two sections:

Simple XML

This is the default way of parsing an XML. Just like the Simple XML Parser this parser is able to parse a XML structure like this one:

<?xml version="1.0" encoding="UTF-8" ?>
<DocRoot>
    <Entry>
        <telephoneNo>
                <ValueTag>555-888-8888</ValueTag>
                <ValueTag>555-999-9999</ValueTag>
        </telephoneNo>
        <User>Jill Vox</User>
    </Entry>
</DocRoot>

When in simple mode the XML parser will make sure that some of the elements in the hierarchy are stripped off to return a simple, flat-like data structure (Entry). The behavior of the parser is controlled by three parameters:

Simple XPath (xpath.expr) field - used to specify the path to the container element which will be searched for the presents of the element specified by the entry.tag parameter. By default this field is configured to find the root element of the input XML.
Entry Tag (entry.tag) field - used to specify the name of the element that represents the entry that will be returned.

The presence of this parameter specifies whether the parser will do a simple or advanced parsing. If this parameter is empty the XML Parser will do advanced parsing.
Value Tag (value.tag) field - used to specify the name of the element that holds a value of a multi-valued attribute.

This parameter is not used if the entry.tag parameter is empty.

The xpath.expr parameter can be used in conjunction with the ns.map parameter to filter some of the elements. For more details see the Advanced XML section.

Using the default values of these parameters the XML Parser can easily parse the example XML above and an entry with the following data will be returned:

{
  "telephoneNo": [
    "555-888-8888",     "555-999-9999"
  ],   "User": "Jill Vox"
}

Here the "Entry" element has been removed and also the ValueTag elements have been taken as values of the "telephoneNo" attribute.

If the structure of the input XML is not known prior to reading then we can remove the value of the entry.tag parameter. This way the whole XML is read at once and will show you what the XML structure looks like. Based on the returned information we can then reconfigure the parser to match the XML structure.

Advanced XML

The XML Parser runs in this mode when the entry.tag parameter is empty.

For each element found only a single Attribute object will be created. On each cycle the XML Parser returns an Entry object which contains only one Attribute that corresponds to the element found in the XML document. The XML Parser returns null if no element that matches any of the XPath expressions is found in the XML document.

There are two parameters that configure the way the parser finds data in the XML.

Simple XPath (xpath.expr) parameter - used to specify the path to the element which contains the desired data. This parameter is required.
Prefix To Namespace Map (ns.map) field - used to declare prefixes and namespaces. As we will see this is not a required parameter but provides more flexibility for finding specific data.

In order to fully describe these two parameters consider this example:

<?xml version="1.0" encoding="UTF-8" ?>
<root xmlns="defaultNS" xmlns:pref1="prefix1NS">   <pref1:container xmlns:pref2="prefix2NS" attribute1="attrValue1" pref1:attribute2="attrValue2">     <pref2:entryElement>
      <someData />
    </pref2:entryElement>
    <pref2:entryElement xmlns:pref2="prefix3NS">       <moreData />
    </pref2:entryElement>
  </pref1:container>
</root>

Let's say that the desired data we need to get is in any of the entry elements. The simplest way to get each entry element is to specify the following element:

 xpath.expr:  /root/container/entryElement

Each iteration will get a single entryElement. For this example we would need two iterations to get both of the entryElement elements. Without specifying the element's prefix or namespace the parser will match any element using the local name we give in the Simple XPath expression.

However you may notice that both entryElements are different since they belong to different namespaces. Let's say that the desired data is the entryElement that belongs to the "prefix3NS" namespace. Using the previous configuration will get us data that is not needed (i.e. the first entryElement). This is where the ns.map comes in since we need to tell the parser where the desired element belongs to. Here is how we get only the second element:

  xpath.expr:  /root/container/pref2:entryElement
  ns.map:  pref2=prefix3NS

Here the parser will match the element's local name (that is, entryElelement) and the namespace. If we do not specify the pref2 in the ns.map field, then the parser will use only the prefix and the local name found in the xpath.expr expression when it does the matching. If we redefine the pref2 in the ns.map the latest definition will be used and any previous will be ignored.

  xpath.expr:  /root/container/p1:entryElement | /root/container/p2:entryElement
  ns.map:  p1=prefix2NS | p2=prefix3NS

In this case the prefixes are ignored and only the elements' local names and namespaces are considered.

The following expression:

  xpath.expr:  /$defPref:root/container/pref2:entryElement
  ns.map:  pref2=prefix3NS | $defPref= defaultNS

This has the same meaning as the second example configuration. However the expression $defPref tells the parser that the root element belongs to the default namespace "defaultNS". This is useful when the default namespace have been predefined in the XML at some place. In this example the parser will only match the local name and the namespace but will expect the XML element it checks belongs to the default namespace (that is, has no prefix). In other words this:

  xpath.expr:  /$defPref:root/$somePref:container/pref2:entryElement
  ns.map:  pref2=prefix3NS | $somePref=prefix1NS | $defPref= defaultNS

will not return any entryElement elements.

The XML Parser has the ability to navigate the XML tree using wildcards. The supported wildcard is the asterisk character - "*", which is used to replace the local name of an element of the XML. Let's say the following configuration is set:

  xpath.expr:  /root/container/*

This expression would retrieve each element under the element with local name "container", thus resulting in two iterations in total. The result would be the same if the xpath.expr is set to "/root/container/pref2:*" and the pref2 is not defined in the ns.map field.

The following configuration:

  xpath.expr:  /root/container/p1*
  ns.map:  p1=prefix2NS

will retrieve all the elements that are under the container element and that belong to the prefix2NS namespace. In our case this is only the first child of the container element.

The following wildcard operations are not allowed: "*:localName", "local*", "pref:*Name", etc. The asterisk character replaces the local name of an element only.

Navigation when writing

The main purpose of the Simple XPath (xpath.expr) parameter is to specify the place where the entry data should be put. . By default this parameter is set to the single wildcard - "*". If the default value is not changed the parser will output a XML with a root element with name DocRoot. You then have the choice to either remove the value of this parameter and have a muti-rooted document or to replace the asterisk with a concrete value.

For example if the following path is set:

  xpath.expr:  /root/container/entry | /otherRoot/otherContainer/moreElements

then the parser will create the structure:

<root>
  <container>
    <entry>  
      /* The Entries go here. */
    </entry>
  </container>
</root>

Where the elements root, container and entry are static since they do not belong to any entry passed to the parser as input. Depending on the configuration of the parser these static elements could be written on each cycle (to wrap each entry) or to wrap all the entries.

Only the first path is used and the rest is ignored.

Using asterisks in the Simple XPath (xpath.expr) parameter when the XML Parser is in output mode will make the parser consider only the path before the first asterisk. For example the expression:

  xpath.expr:  /root/container/*/entry

will be considered as if you specified this expression:

  xpath.expr:  /root/container

The only expression that is an exception to the rule is:

  xpath.expr:  *

this will be read as if this was specified:

  xpath.expr:  DocRoot

You could think of the xpath.expr as the parameter that configures the root element(s) only. The parser has the ability to declare a single element that will wrap each entry output as XML. This element could be configured in the entry.tag field. If this field is missing a value, then no element, wrapping each entry, will be output. By default this parameter has a value so an additional element will be written to the output stream.The parser also provides a convenient field to configure the name of the element that will contain each value of a simple multi-valued Attribute. This could be configured in the value.tag field and by default this is set to ValueTag but if it is removed and the parser is asked to output such an Attribute then each value will put in a element with the name "value".

Notes:

The value.tag parameter is only considered if the entry.tag parameter is not empty. If it is empty the values of a multi-valued attribute will not be wrapped.
Neither the entry.tag nor the value.tag support a wildcard and if such is provided then an exception will be thrown as result.

In order to declare some attributes or prefixes in those static elements then we can use the Static Attributes Declaration (static.decl) field. If you would like to output the XML used in the "Advanced XML" section we need to use the following configuration:

  xpath.expr:  /root/pref1:container/
  static.decl:  <root xmlns="defaultNS" xmlns:pref1="prefixNS">         <pref1:container xmlns:pref2="prefix2NS" attribute1="attrValue1" pref1:attribute2="attrValue2" />
    </root>

From this example we can see that the static.decl uses XML to markup the attributes and the namespaces that need to be output on the static roots. Note that the XML structure must match the resultant xml structure. This field can also contain information about the Entry Tag element.

If in the above example you add the parameter:

  entry.tag:  Entry

you could then add some attributes/namespaces on that level as follows:

    static.decl:  <root xmlns="defaultNS" xmlns:pref1="prefixNS">       <pref1:container xmlns:pref2="prefix2NS" attribute1="attrValue1" pref1:attribute2="attrValue2" />
        <Entry xmlns:="otherDefaultNS" pref1:attribute3="attrValue3" />
      </pref1:container>
    </root>

We could define both the entry.tag and value.tag to have prefixes, just like each of the xpath.expr path's elements could have. The difference between the two is that the prefix for the value.tag element must be defined prior to using it. This could be done using the static.decl field or using the hierarchical entry structure provided in this version. Currently it is not possible to include the value.tag element in the static.decl field as the entry.tag is included.

Reading XML

Each time the parser is asked for an entry it reads data from the InputStream and retrieves it as an Entry object. Each element found in the XML is represented from the Attribute class that implements the org.w3c.dom.Element interface. Each attribute found in the XML is represented from the Property class that implements the org.w3c.dom.Attr interface. Each CDATA found in the XML is represented by the AttributeValue class that implements the org.w3c.dom.CDATASection interface.

The StAX implementation of the XLXP project supports neither DTD nor XSD validation, so no validation is possible.

The XML Parser is able to read multi-rooted XML documents as long as it does not have multiple XML declarations. If it has multiple XML declarations, then we can read the XML if and only if the Ignore repeating XML declarations when reading check box is checked. This however will affect the performance of the parser since the document will be double-checked for repeating XML declarations.

Notes:

Enabling the Ignore repeating XML declarations when reading check box will ignore all the XML declarations (except the first one). This means that if a CDATA section contains an XML declaration it will be ignored. To work around this we can fix the CDATA section manually after the entry is retrieved.
The XML Parser tries to find the appropriate encoding according to the description in Character Encoding in the XML Parser.

A String representation of the retrieved Entry is available and can be accessed using the getCurrentEntryAsXMLString() method.

Writing XML

Each time the parser is asked to write an Entry object in the output stream it will call the toString() method on each object that will be included in the XML (the same way the Simple XML Parser behaves). Each entry will be flushed to the output stream separately, and in case of system failure the last flushed entry will have been safely sent.

The parser has the ability to output each Entry in a separate root, by checking the Multi-rooted Document option. This will result in a multi-rooted document where each entry has its own static root.

If you are appending the XML to an existing XML then it is useful to check the No XML declaration when writing parameter. Checking (enabling) this parameter will instruct the parser to omit the XML declaration that is usually put in the beginning of an XML document.

Character Encoding in the XML Parser

The XML Parser has a parameter Character Encoding that we can use to set the name of the encoding. When set the encoding will be used to decode the InputStream passed to the parser during the initialization. When this parameter is other than blank (empty string) then it will be used, regardless of its value. If for example the InputStream is UTF-16BE encoded, has a Byte Order Mark (BOM) at the beginning and the Character Encoding parameter is set to "UTF-16BE" then the parser will be able to recognize the BOM sequence and will skip it automatically. If the Character Encoding parameter is set to a different encoding (not compatible with the InputStream's encoding) then an exception will be throw which will indicate that an inappropriate encoding is specified.

When you are not sure about the encoding of the InputStream or file then we can let the parser to discover it (if possible). This is the order that the Parser will follow to discover the encoding of the XML if it is not explicitly specified in the configuration (that is, the Character Encoding parameter is empty):

The Parser will check for a BOM. If it is found then the parser will decode the InputStream using the information provided by that BOM. The recognizable encodings (based on the BOM) are: UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE.

The parser does not recognize unusual (reversed) four byte sequences similar to the UTF-32's sequences. In this case an explicit configuration will be required (using the Character Encoding parameter).
If the InputStream or file does not provide a BOM sequence and no explicit configuration is set then the parser will try to guess the encoding and read the XML declaration's encoding attribute value. The encodings: UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE and IBM-1047 (EBCDIC variation) will be used to read the specific encoding. If found, that value will be used to decode the rest of the InputStream or file.

The XML declaration must be set on the first line of the document and must start from the first character.
If the Character Encoding parameter is not set, no BOM is found and no XML declaration is found (or the XML declaration does not have the encoding attribute) then the parser will use the default encoding which is UTF-8.

We recommend that, if the encoding is known at design time, then it is better to be set it explicitly in the XML Parser's configuration. This will increase the performance of the Parser's initialization process because no lookup for an encoding will be done.

When the parser is initialized for writing (Output Mode) then it expects an explicit assignment of the Character Encoding parameter. If no such assignment is done the Parser will use UTF-8 as a default encoding (UTF-8 with no BOM sequence). If any BOM compatible encoding is explicitly specified (UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE) then the parser will set a BOM sequence at the beginning of the stream.

Example

The example bundled in TDI_install_dir/examples/xmlparser2 demonstrates how IBM TDI is able to work with various XML documents using the capabilities of the XML Parser. Refer to the readme.txt file for more information.

Using XSD Schemas

Predefined XSD schema URI

The configuration of the TDI XML Parser includes a parameter that points to XSD schema(s). When the parser is asked for its schema it reads the schema from the XSD and display it. This however requires that the parser is properly configured. If the navigation path is not set no schema can be retrieved. In that case we will have the ability to read an entry to discover a sample schema.

No XSD provided

If no XSD is provided but the parser is properly configured (that is, the navigation path is set) the Parser will try to extract the Schema Location information from the XML. All schemas found will be checked for the desired element's schema. If no schema is found within the XML document then we will need to read an entry (that is, to read part of the XML) in order to display the content of the returned entry - which is default behavior for all schema querying. However the returned entry's structure might not be the same as another entry that is going to be read on the next cycle. This means that the schema displayed cannot be guaranteed to be complete or valid.

Configuring the Schema

To display the schema of the desired element(s) configure the path to it (them) and the path to the corresponding schema(s) (the schema path is optional; see No XSD provided). If there are multiple elements and/or schema paths then all paths should be separated by vertical bar - the "|" character. Regardless if schema paths are entered or not the Parser will check for schema locations inside the XML document. The first schema extracted from the XML will be chosen as leading schema; if no schema is returned then the first schema configured by we will be the leading one.

Schemas can be declared with corresponding namespaces when they are configured in the Parser configuration. The configuration is as follows:

namespace1 schema1 | namespace2 schema2 | noNamespaceSchema | ...

The namespace is used to determine which type in the XSD Schema to which schema file belongs (if it is specified in the schema itself).

Benefits of the leading schema

This will be the first schema which will be checked for the elements that you entered. If the information is not found in the leading schema then the other schemas entered by you or extracted from the XML are checked. It is advisable to use for leading schema the schema that contains the root element of the XPaths that we have entered.

The library used for schema parsing is slow when it comes to creating the XSD Schema. For this reason all paths are kept in a Map and when a schema is needed for the first time then it is created and kept in the Map in case it is needed later.

The result

For each element entered in the element's path an Entry will be created. Each entry will contain two attributes - Name and Type. Name will be the name of the element which we are querying. Type will be the corresponding type of the element found in the schema. If the type found in the schema is a primitive type (that is, no definition could be found for it in the provided schemas) then its name is the value of the Type attribute. If the type found is not a primitive one (that is, we have found a definition for it in the provided schemas) then as a value of the Type attribute is put a new Entry that will contain attributes with names (the names of all found elements and attributes) and values (the type of the corresponding attribute or element). Again if the type is not primitive a new Entry will be created that will be filled in, in exactly the same manner. This will manifest itself as a tree like structure.

When the schema of all elements is found the entries that are created are put in a Vector object, and this object is returned as the result of the schema querying.

This schema will be displayed flat in the current Configuration Editor. This will be enough for test purposes of the Query Schema functionality.

Indicators

Order indicators - There are three possibilities for schema indicators - all, choice and sequence. When one of these indicators is found in the schema file then a property called "#indicator" is set in the result entry. The value of the property is the indicator which is found in the schema. The information inside the Entry must obey the indicator.
Occurrence indicators - These indicators are set as attributes to the corresponding element. See the Attributes section below.

Attributes

If a schema element contains any attributes they are kept in a Map in a property "#attributes" which corresponds to the Entry in which the schema of the element will be written.

Example XSD Schema

Schema name: Order.xsd

Schema path: /path/Order.xsd

Contents of Order.xsd:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            elementFormDefault="qualified"
            xmlns="urn:nonstandard:XSD_Schema" 
            targetNamespace="urn:nonstandard:XSD_Schema" xmlns:stako="Stako">     <xsd:element name="order" type="Order" />
    <xsd:complexType name="Order">         <xsd:all>
            <xsd:element name="user" type="User" minOccurs="1" maxOccurs="1" />
            <xsd:element name="products" type="Products" minOccurs="1" maxOccurs="1" />
        </xsd:all>
    </xsd:complexType>
    <xsd:complexType name="User">         <xsd:all>
            <xsd:element type="xsd:string" name="deliveryAddress" />
            <xsd:element name="fullname">                 <xsd:simpleType>
                    <xsd:restriction base="xsd:string">                         <xsd:maxLength value="30" />
                    </xsd:restriction>
                </xsd:simpleType>
            </xsd:element>
        </xsd:all>
    </xsd:complexType>
    <xsd:complexType name="Products">         <xsd:sequence>
            <xsd:element name="product" type="Product" minOccurs="1" maxOccurs="unbounded" />
        </xsd:sequence>
    </xsd:complexType>
    <xsd:complexType name="Product">         <xsd:attribute name="id" type="xsd:long" use="required" />
        <xsd:attribute name="quantity" type="xsd:positiveInteger" use="required" />
    </xsd:complexType>
</xsd:schema>

Configuring the Parser to display the User and Products schema:

  Simple XPath: /Order/User | /Order/Products 
  XSD Schema Location: /path/Order.xsd

Result

[[Name:user, Type:[fullname:[xsd:string:[xsd:maxLength:30]], deliveryAddress:string]],  [Name:products, Type:[product:[quantity:xsd:positiveInteger, id:xsd:long]]]]

This is the toString() method representation of the Vector returned as result. Everything is in one row.