Character Encoding conversion

IBM Tivoli Directory Integrator

Character Encoding conversion

Java2 uses Unicode as its internal character encoding. Unicode is a double byte character set. When you work with strings and characters in AssemblyLines and Connectors, they are always assumed to be in Unicode. Most Connectors provide some means of character encoding conversion. When you read from text files on the local system, Java2 has already established a default character encoding conversion which is dependent on the platform you are running.

The TDI server has the -n command-line option, which specifies the character set of Config files it will use when writing new ones; it also embeds this character set designator in the file so that it can correctly interpret the file when reading it back in later.

However, occasionally you read or write data from or to text files in which information is encoded in different character encodings. For example, Connectors that require a Parser usually accept a Character Set parameter in the Parser configuration. This parameter must be set to one of the accepted conversion tables as specified by the IANA Charset Registry (http://www.iana.org/assignments/character-sets).

Some files, when UTF-8, UTF-16 or UTF-32 encoded, may contain a Byte Order Marker (BOM) at the beginning of the file. A BOM is the encoding of the characters 0xFEFF. This can be used as a signature for the encoding used. However, the TDI File Connector does not recognize a BOM.

If you try to read a file with a BOM, we should add this code to for example, the Before Selection Hook of the connector:

      var bom = thisConnector.connector.getParser().getReader().read(); // skip the BOM = 65279

Some care must be taken with the HTTP protocol; see IBM Tivoli Directory Integrator V7.1 Reference Guide, in the section about character sets encoding in the description of the HTTP Parser for more details.

Parent topic: Parsers