IBM TDI is written in Java™ which in turn supports Unicode (double byte) character sets. When you work with strings and characters in AssemblyLines and Connectors, they are always assumed to be in Unicode. Most Connectors provide some means of Character Encoding to be used. When you read from text files on the local system, Java has already established a default Character Encoding conversion that is dependent on the platform you are running.
The TDI Server has the -n command line option, which specifies the character set of Config files it will use when writing new ones; it also embeds this character set designator in the file so that it can correctly interpret the file when reading it back in later.
However, occasionally you read or write data from or to text files in which information is encoded in different Character Encodings (this could happen if you are reading a file created on a machine running a different operating system). The Connectors that require a Parser usually accept a Character Set parameter in the Parser configuration. If set, this parameter must be set to one of the accepted conversion tables found in the Java runtime, as governed by the IANA Charset Registry. If this parameter is not set, most Parsers use the local character set. Some Parsers might have specific default character sets. See information about individual Parsers in this guide.
Some files, when UTF-8, UTF-16 or UTF-32 encoded, may contain a Byte Order Marker (BOM) at the beginning of the file. The purpose of the BOM is to help finding the algorithm used for encoding the InputStream to characters. A BOM is the encoding of the character 0xFEFF. This can be used as a signature for the encoding used. The TDI File Connector does not recognize a BOM. Also, these TDI Parsers do not recognize a BOM:
If you try to read a file with a BOM, and the Parser does not know how to handle this, then in order to avoid returning unusable data, we should add this code to, for example, the Before Selection Hook of the connector:
var bom = thisConnector.connector.getParser().getReader().read(); // skip the BOM = 65279 if (bom != -1 && bom != 65279) { //make sure that we are skipping the BOM and not any other meaningful character. throw "Invalid BOM"; }
This code will read and skip the BOM, assuming that we have specified the correct character set for the parser. This workaround is only needed if the Parser does not recognize or process the BOM, or a skip of the BOM is needed in general.
Some care must be taken with the HTTP protocol; see HTTP Parser, section Character sets/Encoding about character sets encoding in the description of the HTTP Parser for more details.
Please refer to the IANA Charset Registry (http://www.iana.org/assignments/character-sets).
A common character set on Windows computers is CP850; for i5/OS a common value is IBM037.