IETF IDN Working Group Seungik Lee, Hyewon Shin, Dongman Lee Internet Draft ICU draft-ietf-idn-icu-00.txt Eunyong Park, Sungil Kim Expires: 14 January 2001 KKU, Netpia.com 14 July 2000 Architecture of Internationalized Domain Name System Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. 1. Abstract For restrict use of Domain Name System (DNS) for domain names with alphanumeric characters only, there needs a way to find an Internet host using multi-lingual domain names: Internationalized Domain Name System (IDNS). This document describes how multi-lingual domain names are handled in a new protocol scheme for IDNS servers and resolvers in architectural view and it updates the [RFC1035] but still preserves the backward compatibility with the current DNS protocol. 2. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. "IDNS" (Internationalized Domain Name System) is used here to indicate a new system designed for a domain name service, which supports multi-lingual domain names. "The current/conventional DNS" or "DNS" (Domain Name System) is used here to indicate the domain name systems currently in use. It fulfills the [RFC1034, RFC1035], but implementations and functional operations may be different from each other. The "alphanumeric" character data used here is the character set that is allowed for a domain name in DNS query format, [a-zA-Z0-9-]. 3. Introduction Domain Name System (DNS) has eliminated the difficulty of remembering the IP addresses. As the Internet becomes spread over all the people, the likelihood that the people who are not familiar with alphanumeric characters use the Internet increases. The domain names in alphanumeric characters are difficult to remember or use for the people who is not educated English. Therefore, it needs a way to find an Internet host using multi-lingual domain name: Internationalized Domain Name System. 3.1 The current issues of IDNS IDNS maps a name to an IP address as the typical DNS does, but it allows domain names to contain multi-lingual characters. The multi- lingual characters need to be encoded/decoded into one standardized format, and it needs changes in the conventional DNS protocol described in [RFC1034] and [RFC1035]. But it is required to minimize the changes in the present DNS protocol so that it guarantees the backward compatibility. The IDNS issues have been discussed in IETF IDN Working Group. These issues are well described in [IDN-REQ]. The main issues are: - Compatibility and interoperability. The DNS protocol is in use widely in the Internet. Although a new protocol is introduced for DNS, the current protocol may be used with no changes. Therefore, a new design for DNS protocol, IDNS must provide backward compatibility and interoperability with the current DNS. - Internationalization. IDNS is on the purpose of using multi-lingual domain names. The international character data must be represented by one standardized format in domain names. - Canonicalization. DNS indexes and matches domain names to look up a domain name from zone data. In the conventional DNS, canonicalization is subjected to US-ASCII only. However, every multi-lingual character data must be canonicalized in its own rules for a DNS standardized matching policy, e.g. case-insensitive matching rule. - Operational issues. IDNS uses international character data for domain names. Normalization and canonicalization of domain names are needed in addition to the current DNS operations. IDNS also needs an operation for interoperability with the current DNS. Therefore, it is needed to specify the operational guidelines for IDNS. 3.2 Overview of the proposed scheme Our proposed scheme for IDNS is also subjected on the issues described earlier to fulfill the requirements of IDN [IDN-REQ]. The proposed scheme can be summarized as following: - The IN bit, which is reserved and currently unused in the DNS query/response format header, is used to distinguish between the queries generated by IDNS servers or resolvers and those of non-IDNS ones [Oscarsson]. This mechanism is also needed to indicate whether the query is generated by the appropriate IDNS operations for canonicalization and normalization or not. - The multi-lingual domain names are encoded into UTF-8 as a wire format. UTF-8 is recommended as a default character encoding scheme (CES) in the creation of new protocols which transmit text in [RFC2130]. This scheme allows the IDNS server to handle the DNS query from non-IDNS servers or resolvers because the ASCII code has no changes in UTF-8. - The UTF-8 domain names must be case-folded before transmission. It minimizes the overhead on server's operations of matching names in case-insensitive. It also guarantees that the result of caching queries can be used without any further normalization and canonicalization. If IDNS server gets non-IDNS query that is not case-folded, it case-folds the query before transmitting to another servers. 4. Design considerations Our proposed scheme is designed to fulfill the requirements of IETF IDN WG [IDN-REQ]. All the methods for IDNS schemes must be approved by the requirements documents. The design described in this document is based on these requirements. 4.1 Protocol Extensions To indicate an IDNS query format, we use an unallocated bit in the current DNS query format header, named 'IN' bit [Oscarsson]. All IDNS queries are set IN bit to 1. Without this bit set to 1, we cannot guarantee that the query is in the appropriate format for IDNS. 'IN' bit is to indicate whether the query is from IDNS resolvers/servers or not. It also reduces overhead on canonicalizing operation at IDNS server. It will be described further in <4.4. Canonicalization>. We devise new operations and new structures of resolvers and name servers to add the multi-lingual domain name handling features into the DNS. This causes changes of all DNS servers and resolvers to use multi-lingual domain names. The new architectures for resolvers and servers will be described in <5. Architectures> 4.2 Compatibility and interoperability The 'IN' bit is valid bit location of query for the conventional DNS protocol to be set to zero [RFC1035]. And operations and structures of IDNS preserve the conventional rules of DNS to guarantee the interoperability with the conventional DNS servers or resolvers so that the changes are optional. These make this scheme for IDNS compatible with the current protocol. Although the current DNS protocol uses 7-bit ASCII characters only, the query format of the current DNS protocol set is 8 bit-clean. Therefore, we can guarantee the backward compatibility and interoperability with the current DNS using UTF-8 code because the ASCII code is preserved with no changes in UTF-8. Note: There are also in use implementations that are compatible with the current DNS but extend their operations to use UTF-8 domain names. The IDNS described here interoperates well with these implementations. The interoperability with these implementations will be described in <5.4 Interoperability with the current DNS>. 4.3 Internationalization All international character data must be represented in one standardized format and the standardized format must be compatible with the current ASCII-based protocols. Therefore, the coded character set (CCS) for IDNS protocol must be Unicode [Unicode], and be encoded using the UTF-8 [RFC2279] character encoding scheme (CES). The client-side interface may allow the domain names encoded in any local character sets, Unicode, ASCII and so on. But they must be encoded into Unicode before being used in IDNS resolver. The IDNS resolver accepts Unicode character data only, and converts it to UTF- 8 finally for transmission. 4.4 Canonicalization In the current DNS protocol, the domain names are matched in case- insensitive. Therefore, the domain names in a query and zone file must be case-folded before equivalence test. The case-folding issue has been discussed for a long time in IETF IDN WG. The main problem is for case folding in locale-dependent. Some different local characters are overlapped within case-folded format. For example, Latin capital letter I (U+0049) case-folded to lower case in the Turkish context will become Latin small letter dotless i (U+0131). But in the English context, it will become Latin small letter i (U+0069) Therefore, we case-fold the domain names in locale-independent in our new IDNS design with a method defined in [UTR21]. Multi-lingual domain names should be case-folded in IDNS resolvers or IDNS servers before transmitting to other IDNS/DNS servers. That is, IDNS resolver should case-fold the domain name and converts it to UTF-8 before transmission. In case of IDNS server, if it gets a query with IN bit set to 1, then it needs not to make the multi-lingual domain name canonicalized anymore. If the IDNS server gets a query with IN bit set to 0, then it cannot determine the query is appropriate canonicalized format for IDNS server, so that it case- folds that multi-lingual domain name in the query, and set 'IN' bit to 1. The current DNS queries contain the original case of domain names to preserve the original cases. To be consistent with this rule, all case-folded multi-lingual domain names should be stored by IDNS resolvers or servers before case-folding, and should be restored before sending response. In the case of case-folding UTF-8 code, using the case-folding method in [UTR21], the UTF-8 should be converted to Unicode and it must be mapped to the mapping table finally. Of course that if we could make a case-folding mapping table of UTF-8 character data, this overhead could be reduced. However it cannot avoid an overhead in IDNS servers for canonicalization, because the canonicalization of international character data is complicated. To minimize this overhead, we use 'IN' bit to indicate that the canonicalization for the query has been already handled. That means it needs not canonicalization operation anymore. The detailed operations according to the 'IN' bit are described later in <5. Architectures>. With international character data, the canonicalization (e.g. case- folding) is much more complicated than the one with US-ASCII, and is different from each other's by their locale contexts. But this document doesn't specify any method or recommendation more than case-folding. For canonicalization of international character data, [UTR15] is a good start. It must be discussed further and specified in the IDNS protocol specification. 4.5 Operational issues In the current DNS scheme, it uses only ASCII code for a wire format. But our new IDNS scheme uses UTF-8 code for a wire format. All the IDNS resolvers must transmit queries encoded in UTF-8 and case-folded. This format can be guaranteed by checking the IN bit: if IN bit is set to 1, the query is encoded in UTF-8 and case-folded. Otherwise the IDNS server cannot assure that the query is encoded in UTF-8 and case-folded. Therefore it needs additional operations for encoding to UTF-8 and case-folding, etc in this case. The current DNS resolvers transmit the queries in ASCII code. But it's not considerable in IDNS servers because the ASCII code is preserved with no changes in UTF-8. Some applications and resolvers transmit the queries in UTF-8 although they don't fit on the new IDNS resolvers' structures, e.g. Microsoft's DNS servers. We cannot guarantee that those queries are case-folded correctly. Therefore, the IDNS servers should convert them to appropriate IDNS queries instead of the IDNS resolver in that case. All detailed operations of IDNS servers and resolvers are described in <5. Architectures>. 5. Architectures 5.1 New header format A new IDNS servers and resolvers must interoperate with the ones of current DNS. Therefore, we need a way to determine whether the query is for IDN or not. For this reason, we use a new header format as proposed in [Oscarsson]. 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | ID | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |QR| Opcode |AA|TC|RD|RA|IN|AD|CD| RCODE | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | QDCOUNT | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | ANCOUNT | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | NSCOUNT | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | ARCOUNT | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ The IDNS resolvers and servers identify themselves in a query or a response by setting the 'IN' bit to 1 in the DNS query/response format header. This bit is defined to be zero by default in the current DNS servers and resolvers. 5.2 Structures of IDNS resolvers To use multi-lingual domain names with IDNS servers, all the IDNS/DNS resolvers must generate the query in a format of UTF-8 or ASCII. The design of a resolver could be different with each other according to the local operating systems or applications. We propose new design guidelines of a resolver for a new standardization. The IDNS resolver accepts Unicode from user interface for domain names. The other character sets should be rejected. It encodes all such character data into UTF-8 for transmission to name servers. The procedures of the operation of an IDNS resolver are below: <1>. If the resolver gets a domain name in Unicode or ASCII then it stores the original domain name query. Otherwise the request for lookup is rejected. In the current DNS protocol, the original case of the domain name should be preserved. Therefore, the resolver must store the original cases of the domain names before canonicalization (e.g. case-folding). <2>. Make the domain name case-folded with locale-independent case- mapping table defined in [UTR21]. <3>. Convert it to UTF-8. <4>. Set IN bit to 1. It indicates the query is from IDNS resolver and the format is UTF-8, case-folded. <5>. Send request query to name servers. <6>. Restore the original domain name query into the response query format. <7>. Send response to the application. 5.3 Structures of IDNS servers The operation of IDNS server is similar to the current one of DNS server, but the IDNS server accepts UTF-8 queries and converts them to the appropriate formats additionally. The IDNS server distinguishes between the IDNS queries and DNS queries by checking IN bit in the query/response format header. According to the 'IN' bit, it operates differently. The procedures of the operation of an IDNS server are below: <1>. If the IN bit in the query/response format header is set to 1 then it matches the domain name within zone file data or forwards request query to resolve. It operates as same as the operations of the current DNS servers but retrieves UTF-8 code. In this case, it needs not to make domain name canonicalized because the domain name is already canonicalized in the previous procedures of IDNS resolvers or IDNS servers. Go to step <7>. <2>. Set IN bit to 1. <3>. Store the original domain name query. <4>. Make the domain name case-folded with locale-independent case- mapping table defined in [UTR21]. <5>. Match the domain name within zone file data or send request query to lookup. <6>. Restore the original domain name query into the response query format. <7>. Send response for the query to the resolver or the other server requested. 5.4 Interoperability with the current DNS The DNS servers and resolvers accept domain names in ASCII only. But IDNS servers and resolvers accept domain names in UTF-8. Therefore, the queries from DNS ones to IDNS ones can be well handled because the UTF-8 is a superset of ASCII code. But the queries from IDNS ones to DNS ones will be rejected because the UTF-8 code is beyond the range of ASCII code. Note: There are some implementations which can handle UTF-8 domain names although they don't fit on this specification of IDNS and fully implemented with DNS protocol specification, e.g. Microsoft's DNS server and resolvers. In this case, we cannot guarantee that the queries from these 3rd-party implementations are encoded into UTF-8 and well canonicalized. But this queries are set 'IN' bit to 0, so that the IDNS evaluates whether the domain name is the range of UTF-8 or not, and converts it into UTF-8 and makes it canonicalized finally. 6. Security Considerations This architecture of IDNS uses 8bit-clean queries for transmission and the UTF-8 code is handled instead of ASCII. The DNS protocol has already allocated 8bit query format for domain names Therefore, the IDNS protocol inherits the security issues for the current DNS. Canonicalization of IDNS is defined in [UTR15] and case folding in [UTR21]. All security issues related with canonicalization or normalization inherits ones described in [UTR15, UTR21]. As always with data, if software does not check for data that can be a problem, security may be affected. As more characters than ASCII is allowed, software only expecting ASCII and with no checks may now get security problems. 7. References [IDN-REQ] James Seng, "Requirements of Internationalized Domain Names," Internet Draft, June 2000 [KWAN] Stuart Kwan, "Using the UTF-8 Character Set in the Domain Name System," Internet Draft, February 2000 [Oscarsson] Dan Oscarsson, "Internationalisation of the Domain Name Service," Internet Draft, February 2000 [RFC1034] Mockapetris, P., "Domain Names - Concepts and Facilities," STD 13, RFC 1034, USC/ISI, November 1987 [RFC1035] Mockapetris, P., "Domain Names - Implementation and Specification," STD 13, RFC 1035, USC/ISI, November 1987 [RFC2119] S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels," RFC 2119, March 1997 [RFC2130] C. Weider et. Al., "The Report of the IAB Character Set Workshop held 29 February - 1 March 1996," RFC 2130, Apr 1997. [RFC2279] F. Yergeau, "UTF-8, a transformation format of ISO 10646," RFC 2279, January 1998 [RFC2535] D. Eastlake, "Domain Name System Security Extensions," RFC 2535, March 1999 [UNICODE] The Unicode Consortium, "The Unicode Standard - Version 3.0," http://www.unicode.org/unicode/ [UTR15] M. Davis and M. Duerst, "Unicode Normalization Forms", Unicode Technical Report #15, Nov 1999, http://www.unicode.org/unicode/reports/tr15/ [UTR21] Mark Davis, "Case Mappings," Unicode Technical Report #21, May 2000, http://www.unicode.org/unicode/reports/tr21 8. Acknowledgments Kyoungseok Kim <gimgs@asadal.cs.pusan.ac.kr> Chinhyun Bae <piano@netpia.com> 9. Author's Addresses Seungik Lee Email: silee@icu.ac.kr Hyewon Shin Email: hwshin@icu.ac.kr Dongman Lee Email: dlee@icu.ac.kr Information & Communications University 58-4 Whaam-dong Yuseong-gu Taejon, 305-348 Korea Eunyong Park Email: eunyong@eunyong.pe.kr Konkuk University 93-1 Mojindong, Kwangjin-ku Seoul, 143-701 Korea Sungil Kim Email: clicky@netpia.com Netpia.com 35-1 8-ga Youngdeungpo-dong Youngdeungpo-gu Seoul, 150-038 Korea