INTERNET-DRAFT Stuart Kwan James Gilroy Microsoft Corp. July 2000 <draft-skwan-utf8-dns-04.txt> Expires January 2001 Using the UTF-8 Character Set in the Domain Name System Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract The Domain Name System standard specifies that names are represented using the ASCII character encoding. This document expands that specification to allow the use of the UTF-8 character encoding, a superset of ASCII and a translation of the UCS-2 character encoding. Expires January 2001 [Page 1] INTERNET-DRAFT UTF-8 DNS July 2000 1. Introduction The Domain Name System standard [RFC1035] specifies that names are represented using the ASCII character encoding. This document expands that specification to allow the use of the UTF-8 character encoding [RFC2044], a superset of ASCII and a translation of the UCS-2 character encoding. Interpreting names as ASCII-only limits the utility of DNS in an international setting. The UTF-8 character set includes characters from most of the world's written languages, allowing a far greater range of possible names and allowing names to use characters that are relevant to a particular locality. UTF-8 is the recommended character set for protocols that are evolving beyond ASCII [RFC2130]. This document defines the technology for a richer character set in DNS. This document specifically does not define policy for the characters allowed in a name when used in a particular application. For example, some protocols place restrictions on the characters allowed in a name. In addition, names that are intended to be globally visible [RFC1958] should contain ASCII-only characters per [RFC1123]. 2. Protocol Description A UTF-8-aware DNS server is a DNS server that can load and store DNS names that contain UTF-8 characters. Names are encoded in logical order as opposed to visual order (see [UNICODE 2.0]). Uniform downcasing permits UTF-8-aware DNS implementations to interoperate with non-UTF-8-aware DNS implementations. Any binary string can be used in a DNS name [RFC2181], but names must be compared with case-insensitivity [RFC1035]. A non-UTF-8-aware DNS implementation is unable to perform a case-insensitive comparison on a name containing UTF-8 characters. However, if UTF-8 names are downcased before transmission, then binary comparisons will provide the desired result on non-UTF-8-aware servers without violating the case-insensitivity requirement. The DNS protocol standard states that original case should be preserved when possible as data is entered into the system. This requirement is modified as follows: a UTF-8-aware DNS server must downcase all names containing UTF-8 characters in both record names and record data before transmitting those names in any message. A UTF-8-aware DNS client/resolver must downcase all names containing UTF-8 characters before transmitting those names in any message. Expires January 2001 [Page 2] INTERNET-DRAFT UTF-8 DNS July 2000 For consistency, UTF-8-aware DNS servers must compare names that contain UTF-8 characters byte-for-byte, as opposed to using Unicode equivalency rules. Applications should take care when allowing uppercase UTF-8 characters to be passed to the resolver, and DNS servers should take care when allowing uppercase UTF-8 characters to be entered in zone data. Downcasing in UTF-8 is locale-sensitive and the result may vary according to the locale of the code execution. The desired result will always be obtained if the application and server only accept lowercase characters. Names encoded in UTF-8 must not exceed the size limits clarified in [RFC2181]. Character count is insufficient to determine size, since some UTF-8 characters exceed one octet in length. 3. Interoperability Considerations The UTF-8 character encoding is ideal for use with existing protocol implementations that expect US-ASCII characters. The representation of a US-ASCII characters in UTF-8 is byte for byte identical to the US-ASCII representation. Non-UTF-8-aware DNS clients always encode names in ASCII format and those names will always be correctly interpreted by a UTF-8-aware DNS server. DNS server authors may wish to provide a configuration switch on the DNS server to allow/disallow the use of UTF-8 characters on a per-server or per-zone basis. A non-UTF-8-aware DNS server may accept a zone transfer of a zone containing UTF-8 names, but it may not be able to write back those names to a zone file or reload those names from a zone file. Administrators should exercise caution when transferring a zone containing UTF-8 names to a non-UTF-8-aware DNS server. 4. Security Considerations The choice of character encoding for names does not impact the security of the DNS protocol. 5. Acknowledgements The authors of this document would like to thank the following people for their contribution to this specification: John McConnell, Cliff Van Dyke and Bjorn Rettig. Expires January 2001 [Page 3] INTERNET-DRAFT UTF-8 DNS July 2000 6. References [RFC1035] P.V. Mockapetris, "Domain Names - Implementation and Specification," RFC 1035, ISI, Nov 1987. [RFC2044] F. Yergeau, "UTF-8, a transformation format of Unicode and ISO 10646," RFC 2044, Alis Technologies, Oct 1996. [RFC1958] B. Carpenter, "Architectural Principles of the Internet," RFC 1958, IAB, June 1996. [RFC1123] R. Braden, "Requirements for Internet Hosts - Application and Support," STD 3, RFC 1123, January 1989. [RFC2130] C. Weider et. al., "The Report of the IAB Character Set Workshop held 29 July - 1 March 1996", RFC 2130, Apr 1997. [RFC2181] R. Elz and R. Bush, "Clarifications to the DNS Specification," RFC 2181, University of Melbourne and RGnet Inc, July 1997. [UNICODE 2.0] The Unicode Consortium, "The Unicode Standard, Version 2.0," Addison-Wesley, 1996. ISBN 0-201-48345-9. 7. Author's Addresses Stuart Kwan James Gilroy Microsoft Corporation Microsoft Corporation One Microsoft Way One Microsoft Way Redmond, WA 98052 Redmond, WA 98052 USA USA <skwan@microsoft.com> <jamesg@microsoft.com> Expires January 2001 [Page 4]