Simple XML Parser

IBM Tivoli Directory Integrator

Simple XML Parser

The Simple XML Parser reads and writes XML documents; it deals with XML data which is not more than two levels deep. This Parser uses the Apache Xerces and Xalan libraries. The Parser gives access to XML document through a script object called xmldom. The xmldom object is an instance of the org.w3c.dom.Document interface. Refer to http://java.sun.com/xml/jaxp-1.0.1/docs/api/index.html for a complete description of this interface.

We can also use the XPathAPI (http://xml.apache.org/xalan-j/apidocs/index.html and access its Java™ Classes in your Scripts) to search and select nodes from the XML document. selectNodeList, a convenience method in the system object, can be used to select a subset from the XML document.

When the Connector is initialized, the Simple XML Parser tries to perform Document Type Definition (DTD) verification if a DTD tag is present.

Use the Connector's override functions to interpret or generate the XML document yourself. Create the necessary script in either the Override GetNext or GetNext Successful in your AssemblyLine's hook definitions. If we do not override, the Parser reads or writes a very simple XML document that mimics the entry object model. The default Parser only permits you to read or write XML files two levels deep. It will also read multi-valued attributes, although only one of the multi-value attributes will be shown when browsing the data in the Schema tab.

Note that certain methods, such as setAttribute are available in both the IBM TDI entry and the objects returned by xmldom.createElement. These functions have the same name or signature. Do not confuse the xmldom objects with the IBM TDI objects.

Notes:

This Parser was called "XML Parser" in pre-TDI 7.0 releases. In TDI 7.0 it is renamed to Simple XML Parser and a new XML Parser was added; see XML Parser. The new Parser has a lot of improvements and is now the main TDI XML Parser.
If you read large (more than 4MB) or write large (more than 14MB) XML files, your Java VM may run out of memory. Refer to "Increasing the memory available to the Virtual Machine" in IBM TDI V7.1 Users Guide for a solution to this. Alternatively, use the XML Parser or the XML SAX Parser.
The Parser silently ignores empty entries.
When reading a CDATA attribute, no blank space is trimmed from the value. However, blank space is trimmed from attributes that are not CDATA.
Certain characters, such as $, are illegal in XML tags. Avoid these characters in your attribute names when using the XML Parser because these characters might create illegal XML.
When reading from an LDAP directory or an LDIF file, the distinguished name (DN) is typically returned in an attribute named $dn. If you map this attribute without changing the name into an XML file, it fails because $dn is not a legal tag in an XML document. If you do explicit mapping, change "$dn" to "dn" (or something without a special character) in the output Connector. If we do implicit mapping, for example, * or Automatically map all attributes checked in the AssemblyLine Settings (through the Config . . . tab of the AssemblyLine), we can configure the XML Parser to translate the distinguished name (for example, $dn) to a different name. For example, we can add something like this in the Before GetNext Hook:
```
conn.setAttribute("dn", work.getAttribute("$dn")); 
conn.removeAttribute("$dn");
```

Configuration

The Parser has the following parameters:

Root Tag

The root tag (output).

Entry Tag

The entry tag for entries (output).

Value Tag

The value tag for entry attributes (output).

Character Encoding

Character Encoding to be used. See Character Encoding in the Simple XML Parser.

Omit XML Declaration

If checked, the XML declaration is omitted in the output stream.

Document Validation

If checked, this parser requests a DTD/Schema-validating parser.

Namespace Aware

If checked, this parser requests a namespace-aware parser.

Indent Output

If this field is checked, then the output is indented.

If this text is to be processed by a program (and not meant for human interpretation) you most likely will want to deselect this parameter. This way, no unnecessary spaces or newlines will be inserted in the output.

Detailed Log

If this parameter is checked, more detailed log messages are generated.

Character Encoding in the Simple XML Parser

The default and recommended Character Encoding to use when deploying the Simple XML Parser is UTF-8. This will preserve data integrity of the XML data in most cases. When you are forced to use a different encoding, the Parser will handle the various encodings in the following way:

When reading a file, parser will look for encoding in the following order:
1. If the TDI CharacterSet config parameter is set, the encoding is set to the value specified in this parameter. However, check #2 is attempted and will overwrite this check if successful when the encoding specified is UTF-32 or UTF-16.
2. The XML is checked for the existence of an encoding attribute from the XML declaration. First, the XML is checked to see if a BOM exists. If it does, the encoding specified in the BOM is used to retrieve the encoding attribute from the XML declaration. Otherwise, the default encoding of the JRE is used to retrieve the attribute. If the encoding attribute from the XML declaration is found, this value will be used.
3. If the TDI CharacterSet was not set and no encoding attribute from the XML declaration is found, then the BOM encoding will be used if it is set.
4. The default encoding of the JRE is used if none of the above are true.
On output, the Parser will write an XML header specifying the character encoding. This will be the encoding specified in the Parser config. If nothing is specified there, UTF-8 will be used.

Examples

Override Add hook:

var root = xmldom.getDocumentElement();
var entry = xmldom.createElement ("entry");
var names = work.getAttributeNames();

for ( i = 0; i < names.length; i++ ) {
  xmlNode = xmldom.createElement ("attribute");
  xmlNode.setAttribute ( "name", names[i] );
  xmlNode.appendChild ( xmldom.createTextNode ( work.getString( 
      names[i] ) ) );
  entry.appendChild ( xmlNode );
}
root.appendChild ( entry );

After Selection hook:

//
// Set up variables for "override getnext" hook
//

var root = xmldom.getDocumentElement();
var list = system.selectNodeList ( root, "//Entry" );
var counter = 0;

Override GetNext hook

//
// Note that the Iterator hooks are NOT called when we override the
    getnext function
// Initialization done in After Select Entries hook


var nxt = list.item ( counter );

if ( nxt != null ) {
   var ch = nxt.getFirstChild();
   while ( ch != null ) {
      var child = ch.getFirstChild();
      while (child != null ) {
        // Use the grandchild's value if it exist, to be able to 
        read multivalue attributes
  grandchild = child.getFirstChild();
  if (grandchild != null)
    nodeValue = grandchild.getNodeValue();
       else nodeValue = child.getNodeValue();
  // Ignore strings containing newlines, they are just fillers
       if (nodeValue != null && nodeValue.indexOf('\n') 
        == -1) {
             work.addAttributeValue ( ch.getNodeName(), nodeValue );
       }
  child = child.getNextSibling();
      }
      ch = ch.getNextSibling();
   }
   
   result.setStatus (1); // Not end of input yet
   counter++;
} else {
   result.setStatus (0); // Signal end of input
}

The previous example parses files containing items that look like the following entries:

<DocRoot>
  <Entry>
    <firstName>John</firstName>
    <lastName>Doe</lastName>
    <title>Engineer</title>
  </Entry>
  <Entry>
    <firstName>Al</firstName>
    <lastName">Bundy</lastName>
    <title">Shoe salesman</title>
  </Entry>
</DocRoot>

Suppose instead that the input looks like the following entries:

<DocRoot>
  <Entry>
    <field name="firstName">John</field>
    <field name="lastName">Doe</field>
    <field name="title">Engineer</field>
  </Entry>
  <Entry>
    <field name="firstName">Al</field>
    <field name="lastName">Bundy</field>
    <field name="title">Shoe salesman</field>
  </Entry>
</DocRoot>

Here the attribute names can be retrieved from attributes of the field node, and this code is used in the Override GetNext Hook:

var nxt = list.item ( counter );

if ( nxt != null ) {
 var ch = nxt.getFirstChild();
 while ( ch != null ) {
  if(String(ch.getNodeName()) == "field") {
   attrName = ch.getAttributes().item(0).getNodeValue();
   nodeValue = ch.getFirstChild().getNodeValue();
   work.addAttributeValue ( attrName, nodeValue );
  }
  ch = ch.getNextSibling();
 }

 result.setStatus (1); // Not end of input yet
 counter++;
} else {
 result.setStatus (0); // Signal end of input
}

This example package demonstrates how the base Simple XML Parser functionality can be extended to read XML more than two levels deep, by using the Override GetNext and Override Add hooks.

Additional Examples

Go to the root_directory/examples/simplexmlparser directory of the IBM TDI.