This XML Parser is introduced for the first time in TDI v7.0. It uses the XLXP implementation of the StAX (JSR-173) specification. StAX is a cursor based XML parser, capable of both reading and writing XML.
The traditional DOM-based Parser available in older versions of TDI has been renamed, and is now available as the Simple XML Parser. The new XML Parser is deemed a replacement for the older component, and you are encouraged to migrate your older Configs to use the new Parser.
A Connector uses the XML Parser to either retrieve a TDI Entry object from source XML or output a TDI Entry object as XML. The XML Parser uses the StAX cursor based parser internally. In previous versions of the TDI XML Parser (now the Simple XML Parser) the DOM mechanism was used for parsing a XML. The main advantages of the StAX implementation is that now the TDI Parser is much faster because it does not need to load the whole XML structure in memory like DOM does. Because of its memory efficiency the StAX implementation is more suited when the TDI solution is supposed to deal with unusually large XML structures.
The only drawback of this memory efficient mechanism of parsing an XML data is that no random element access is available since all StAX does is running through an XML structure and pulls one element at a time. Depending on the configuration of the TDI XML Parser each one of the elements pulled out could be either skipped or put in an Entry with Attributes representing each element being pulled out of the XML.
The XML Parser has the following configurable parameters:
<!-- this is an example for statically declared XML attributes/namespaces --> <!-- DocRoot xmlns="defaultNS" attr1="val2"> <Entry xmlns:p1="p1NS" p1:attr2="val2" /> </DocRoot-->
The XML Parser recognizes very simple XPath expressions. According to the expression the parser finds and returns an Entry that will either contain a single Attribute object representing the element itself or multiple Attribute objects in case the wrapping/unwrapping function of the parser is utilized. Current XPath implementations require random element access (over an Object Model) to pinpoint the element(s) referred by the XPath expression. Since a StAX parser does not provide this feature (random element access) it can only work with simple XPath expressions like these:
We can provide several simple paths if the structure of the XML is quite complex. Each XPath expression is separated from the previous using the pipe char - "|". Each expression is used for finding elements in the XML document. By default the XML Parser is able to work with XMLs with two-levels in depth, just like the Simple XML Parser can. In additionally the XML Parser provides an easy way for working with arbitrary deep and complex hierarchical structures. For more details, take a look at these two sections:
This is the default way of parsing an XML. Just like the Simple XML Parser this parser is able to parse a XML structure like this one:
<?xml version="1.0" encoding="UTF-8" ?> <DocRoot> <Entry> <telephoneNo> <ValueTag>555-888-8888</ValueTag> <ValueTag>555-999-9999</ValueTag> </telephoneNo> <User>Jill Vox</User> </Entry> </DocRoot>
When in simple mode the XML parser will make sure that some of the elements in the hierarchy are stripped off to return a simple, flat-like data structure (Entry). The behavior of the parser is controlled by three parameters:
The presence of this parameter specifies whether the parser will do a simple or advanced parsing. If this parameter is empty the XML Parser will do advanced parsing.
This parameter is not used if the entry.tag parameter is empty.
The xpath.expr parameter can be used in conjunction with the ns.map parameter to filter some of the elements. For more details see the Advanced XML section.
Using the default values of these parameters the XML Parser can easily parse the example XML above and an entry with the following data will be returned:
{ "telephoneNo": [ "555-888-8888", "555-999-9999" ], "User": "Jill Vox" }
Here the "Entry" element has been removed and also the ValueTag elements have been taken as values of the "telephoneNo" attribute.
If the structure of the input XML is not known prior to reading then we can remove the value of the entry.tag parameter. This way the whole XML is read at once and will show you what the XML structure looks like. Based on the returned information we can then reconfigure the parser to match the XML structure.
The XML Parser runs in this mode when the entry.tag parameter is empty.
For each element found only a single Attribute object will be created. On each cycle the XML Parser returns an Entry object which contains only one Attribute that corresponds to the element found in the XML document. The XML Parser returns null if no element that matches any of the XPath expressions is found in the XML document.
There are two parameters that configure the way the parser finds data in the XML.
In order to fully describe these two parameters consider this example:
<?xml version="1.0" encoding="UTF-8" ?> <root xmlns="defaultNS" xmlns:pref1="prefix1NS"> <pref1:container xmlns:pref2="prefix2NS" attribute1="attrValue1" pref1:attribute2="attrValue2"> <pref2:entryElement> <someData /> </pref2:entryElement> <pref2:entryElement xmlns:pref2="prefix3NS"> <moreData /> </pref2:entryElement> </pref1:container> </root>
Let's say that the desired data we need to get is in any of the entry elements. The simplest way to get each entry element is to specify the following element:
xpath.expr: /root/container/entryElement
Each iteration will get a single entryElement. For this example we would need two iterations to get both of the entryElement elements. Without specifying the element's prefix or namespace the parser will match any element using the local name we give in the Simple XPath expression.
However you may notice that both entryElements are different since they belong to different namespaces. Let's say that the desired data is the entryElement that belongs to the "prefix3NS" namespace. Using the previous configuration will get us data that is not needed (i.e. the first entryElement). This is where the ns.map comes in since we need to tell the parser where the desired element belongs to. Here is how we get only the second element:
xpath.expr: /root/container/pref2:entryElement ns.map: pref2=prefix3NS
Here the parser will match the element's local name (that is, entryElelement) and the namespace. If we do not specify the pref2 in the ns.map field, then the parser will use only the prefix and the local name found in the xpath.expr expression when it does the matching. If we redefine the pref2 in the ns.map the latest definition will be used and any previous will be ignored.
xpath.expr: /root/container/p1:entryElement | /root/container/p2:entryElement ns.map: p1=prefix2NS | p2=prefix3NS
In this case the prefixes are ignored and only the elements' local names and namespaces are considered.
The following expression:
xpath.expr: /$defPref:root/container/pref2:entryElement ns.map: pref2=prefix3NS | $defPref= defaultNS
This has the same meaning as the second example configuration. However the expression $defPref tells the parser that the root element belongs to the default namespace "defaultNS". This is useful when the default namespace have been predefined in the XML at some place. In this example the parser will only match the local name and the namespace but will expect the XML element it checks belongs to the default namespace (that is, has no prefix). In other words this:
xpath.expr: /$defPref:root/$somePref:container/pref2:entryElement ns.map: pref2=prefix3NS | $somePref=prefix1NS | $defPref= defaultNS
will not return any entryElement elements.
The XML Parser has the ability to navigate the XML tree using wildcards. The supported wildcard is the asterisk character - "*", which is used to replace the local name of an element of the XML. Let's say the following configuration is set:
xpath.expr: /root/container/*
This expression would retrieve each element under the element with local name "container", thus resulting in two iterations in total. The result would be the same if the xpath.expr is set to "/root/container/pref2:*" and the pref2 is not defined in the ns.map field.
The following configuration:
xpath.expr: /root/container/p1* ns.map: p1=prefix2NS
will retrieve all the elements that are under the container element and that belong to the prefix2NS namespace. In our case this is only the first child of the container element.
The following wildcard operations are not allowed: "*:localName", "local*", "pref:*Name", etc. The asterisk character replaces the local name of an element only.
The main purpose of the Simple XPath (xpath.expr) parameter is to specify the place where the entry data should be put. . By default this parameter is set to the single wildcard - "*". If the default value is not changed the parser will output a XML with a root element with name DocRoot. You then have the choice to either remove the value of this parameter and have a muti-rooted document or to replace the asterisk with a concrete value.
For example if the following path is set:
xpath.expr: /root/container/entry | /otherRoot/otherContainer/moreElements
then the parser will create the structure:
<root> <container> <entry> /* The Entries go here. */ </entry> </container> </root>
Where the elements root, container and entry are static since they do not belong to any entry passed to the parser as input. Depending on the configuration of the parser these static elements could be written on each cycle (to wrap each entry) or to wrap all the entries.
Only the first path is used and the rest is ignored.
Using asterisks in the Simple XPath (xpath.expr) parameter when the XML Parser is in output mode will make the parser consider only the path before the first asterisk. For example the expression:
xpath.expr: /root/container/*/entry
will be considered as if you specified this expression:
xpath.expr: /root/container
The only expression that is an exception to the rule is:
xpath.expr: *
this will be read as if this was specified:
xpath.expr: DocRoot
You
could think of the xpath.expr as the parameter that configures
the root element(s) only. The parser has the ability to declare a
single element that will wrap each entry output as XML. This element
could be configured in the entry.tag field. If this field
is missing a value, then no element, wrapping each entry, will be
output. By default this parameter has a value so an additional element
will be written to the output stream.The parser also provides a convenient
field to configure the name of the element that will contain each
value of a simple multi-valued Attribute. This could be configured
in the value.tag field and by default this is set to ValueTag but
if it is removed and the parser is asked to output such an Attribute
then each value will put in a element with the name "value".
Notes:
In order to declare some attributes or prefixes in those static
elements then we can use the Static Attributes Declaration (static.decl)
field. If you would like to output the XML used in the "Advanced XML" section we need to use the following
configuration:
From this example we can see that the static.decl uses
XML to markup the attributes and the namespaces that need to be output
on the static roots. Note that the XML structure must match the resultant
xml structure. This field can also contain information about the Entry
Tag element.
If in the above example you add the parameter:
you
could then add some attributes/namespaces on that level as follows:
We could define both the entry.tag and value.tag to have prefixes, just like each of the xpath.expr path's elements could have.
The difference between the two is that the prefix for the value.tag
element must be defined prior to using it. This could be done using
the static.decl field or using the hierarchical entry structure provided
in this version. Currently it is not possible to include the value.tag
element in the static.decl field as the entry.tag is included.
Each time the parser is asked for an entry it reads data from the
InputStream and retrieves it as an Entry object. Each element found
in the XML is represented from the Attribute class that implements
the org.w3c.dom.Element interface. Each attribute found
in the XML is represented from the Property class that implements
the org.w3c.dom.Attr interface. Each CDATA found in the
XML is represented by the AttributeValue class that implements the org.w3c.dom.CDATASection interface.
The StAX implementation of the XLXP project supports neither DTD
nor XSD validation, so no validation is possible.
The XML Parser is able to read multi-rooted XML documents as long
as it does not have multiple XML declarations. If it has multiple
XML declarations, then we can read the XML if and only if the Ignore
repeating XML declarations when reading check box is checked.
This however will affect the performance of the parser since the document
will be double-checked for repeating XML declarations.
Notes:
A String representation of the retrieved Entry is available and
can be accessed using the getCurrentEntryAsXMLString() method.
Each time the parser is asked to write an Entry object in the output
stream it will call the toString() method on each object
that will be included in the XML (the same way the Simple XML Parser
behaves). Each entry will be flushed to the output stream separately, and in case of system failure the last flushed entry will have been
safely sent.
The parser has the ability to output each Entry in a separate root, by checking the Multi-rooted Document option. This will result
in a multi-rooted document where each entry has its own static root.
If you are appending the XML to an existing XML then it is useful
to check the No XML declaration when writing parameter. Checking
(enabling) this parameter will instruct the parser to omit the XML
declaration that is usually put in the beginning of an XML document.
The XML Parser has a parameter Character Encoding that
we can use to set the name of the encoding. When set the encoding
will be used to decode the InputStream passed to the parser during
the initialization. When this parameter is other than blank (empty
string) then it will be used, regardless of its value. If for example
the InputStream is UTF-16BE encoded, has a Byte Order Mark (BOM) at
the beginning and the Character Encoding parameter is set
to "UTF-16BE" then the parser will be able to recognize the BOM sequence
and will skip it automatically. If the Character Encoding parameter
is set to a different encoding (not compatible with the InputStream's
encoding) then an exception will be throw which will indicate that
an inappropriate encoding is specified.
When you are not sure about the encoding of the InputStream or
file then we can let the parser to discover it (if possible). This
is the order that the Parser will follow to discover the encoding
of the XML if it is not explicitly specified in the configuration
(that is, the Character Encoding parameter is empty):
The parser does not
recognize unusual (reversed) four byte sequences similar to the UTF-32's
sequences. In this case an explicit configuration will be required
(using the Character Encoding parameter).
The XML declaration must be set on the first
line of the document and must start from the first character.
We recommend that, if the encoding is known at design time, then
it is better to be set it explicitly in the XML Parser's configuration.
This will increase the performance of the Parser's initialization
process because no lookup for an encoding will be done.
When the parser is initialized for writing (Output Mode) then it
expects an explicit assignment of the Character Encoding parameter.
If no such assignment is done the Parser will use UTF-8 as a default
encoding (UTF-8 with no BOM sequence). If any BOM compatible encoding
is explicitly specified (UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE)
then the parser will set a BOM sequence at the beginning of the stream.
The example bundled in TDI_install_dir/examples/xmlparser2 demonstrates
how IBM TDI is able to work with various XML documents using the capabilities
of the XML Parser. Refer to the readme.txt file
for more information.
The configuration of the TDI XML Parser includes a parameter that
points to XSD schema(s). When the parser is asked for its schema it
reads the schema from the XSD and display it. This however requires
that the parser is properly configured. If the navigation path is
not set no schema can be retrieved. In that case we will have the
ability to read an entry to discover a sample schema.
If no XSD is provided but the parser is properly configured (that
is, the navigation path is set) the Parser will try to extract the
Schema Location information from the XML. All schemas found will be
checked for the desired element's schema. If no schema is found within
the XML document then we will need to read an entry (that is, to
read part of the XML) in order to display the content of the returned
entry - which is default behavior for all schema querying.
However the returned entry's structure might not be the same as
another entry that is going to be read on the next cycle. This means
that the schema displayed cannot be guaranteed to be complete or valid.
To display the schema of the desired element(s) configure
the path to it (them) and the path to the corresponding schema(s)
(the schema path is optional; see No XSD provided). If there
are multiple elements and/or schema paths then all paths should be
separated by vertical bar - the "|" character. Regardless if
schema paths are entered or not the Parser will check for schema locations
inside the XML document. The first schema extracted from the XML will
be chosen as leading schema; if no schema is returned then the first
schema configured by we will be the leading one.
Schemas can be declared with corresponding namespaces when they
are configured in the Parser configuration. The configuration is as
follows:
The namespace is used to determine which type in the XSD Schema
to which schema file belongs (if it is specified in the schema itself).
Benefits of the leading schema
This will be the first schema which will be checked for the elements
that you entered. If the information is not found in the leading schema
then the other schemas entered by you or extracted from the XML are
checked. It is advisable to use for leading schema the schema that
contains the root element of the XPaths that we have entered.
The library used for schema parsing is slow when it comes to creating
the XSD Schema. For this reason all paths are kept in a Map and when
a schema is needed for the first time then it is created and kept
in the Map in case it is needed later.
The result
For each element entered in the element's path an Entry will be
created. Each entry will contain two attributes - Name and
Type. Name will be the name of the element which we are querying.
Type will be the corresponding type of the element found in the schema.
If the type found in the schema is a primitive type (that is, no definition
could be found for it in the provided schemas) then its name is the
value of the Type attribute. If the type found is not a
primitive one (that is, we have found a definition for it in the provided
schemas) then as a value of the Type attribute is put a new Entry
that will contain attributes with names (the names of all found elements
and attributes) and values (the type of the corresponding attribute
or element). Again if the type is not primitive a new Entry will be
created that will be filled in, in exactly the same manner. This will
manifest itself as a tree like structure.
When the schema of all elements is found the entries that are created
are put in a Vector object, and this object is returned as the result
of the schema querying.
This schema will be displayed flat in the current
Configuration Editor. This will be enough for test purposes of the
Query Schema functionality.
Indicators
Attributes
If a schema element contains any attributes they are kept in a
Map in a property "#attributes" which corresponds to the Entry in
which the schema of the element will be written.
Schema name: Order.xsd
Schema path: /path/Order.xsd
Contents of Order.xsd:
Configuring the Parser to display the User and Products schema:
Result
This is the toString() method representation of the
Vector returned as result. Everything is in one row.
xpath.expr: /root/pref1:container/
static.decl: <root xmlns="defaultNS" xmlns:pref1="prefixNS"> <pref1:container xmlns:pref2="prefix2NS" attribute1="attrValue1" pref1:attribute2="attrValue2" />
</root>
entry.tag: Entry
static.decl: <root xmlns="defaultNS" xmlns:pref1="prefixNS"> <pref1:container xmlns:pref2="prefix2NS" attribute1="attrValue1" pref1:attribute2="attrValue2" />
<Entry xmlns:="otherDefaultNS" pref1:attribute3="attrValue3" />
</pref1:container>
</root>
Reading XML
Writing XML
Character Encoding in the XML Parser
Example
Using XSD Schemas
Predefined XSD schema URI
No XSD provided
Configuring the Schema
namespace1 schema1 | namespace2 schema2 | noNamespaceSchema | ...
Example XSD Schema
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified"
xmlns="urn:nonstandard:XSD_Schema"
targetNamespace="urn:nonstandard:XSD_Schema" xmlns:stako="Stako"> <xsd:element name="order" type="Order" />
<xsd:complexType name="Order"> <xsd:all>
<xsd:element name="user" type="User" minOccurs="1" maxOccurs="1" />
<xsd:element name="products" type="Products" minOccurs="1" maxOccurs="1" />
</xsd:all>
</xsd:complexType>
<xsd:complexType name="User"> <xsd:all>
<xsd:element type="xsd:string" name="deliveryAddress" />
<xsd:element name="fullname"> <xsd:simpleType>
<xsd:restriction base="xsd:string"> <xsd:maxLength value="30" />
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
</xsd:all>
</xsd:complexType>
<xsd:complexType name="Products"> <xsd:sequence>
<xsd:element name="product" type="Product" minOccurs="1" maxOccurs="unbounded" />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="Product"> <xsd:attribute name="id" type="xsd:long" use="required" />
<xsd:attribute name="quantity" type="xsd:positiveInteger" use="required" />
</xsd:complexType>
</xsd:schema>
Simple XPath: /Order/User | /Order/Products
XSD Schema Location: /path/Order.xsd
[[Name:user, Type:[fullname:[xsd:string:[xsd:maxLength:30]], deliveryAddress:string]], [Name:products, Type:[product:[quantity:xsd:positiveInteger, id:xsd:long]]]]