UTF-8 and Unicode considerations when using message selectors

Characters, not enclosed in single quotation marks, that make up the reserved keywords of a selection string must be entered in Basic Latin Unicode (ranging from character U+0000 to U+0007F). It is not valid to use other code point representations of alphanumeric characters. For example, the number 1 must be expressed as U+0031 in Unicode, it is not valid to use the Fullwidth Digit equivalent U+FF11 or the Arabic equivalent U+0661.

Message property names can be specified using any valid sequence of Unicode characters. Message property names contained within selection strings that are encoded in UTF-8 will be validated even if they contain multi-byte characters. Validation of multi-byte UTF-8 is strict and we must ensure that valid UTF-8 sequences are used for message property names. Characters beyond the Unicode Basic Multilingual Plane (those above U+FFFF), represented in UTF-16 by surrogate code points (X'D800' through X'DFFF'), or four bytes in UTF-8, are not supported in message property names.

No extra processing is performed on property names or values when comparing for equality. This means for example that no pre/de-composition takes place and ligatures are not given any special meaning. For example, the pre-composed umlaut character U+00FC is not considered to be equivalent to U+0075 + U+0308 and the character sequence ff is not considered to be equivalent to the Unicode U+FB00 (LATIN SMALL LIGATURE FF)

Property data enclosed in single quotation marks can be represented by any sequence of bytes and is not validated.

Parent topic: Message selector syntax