draft-ietf-idn-nameprep-00.txt

     
Internet Draft                                          Paul Hoffman
draft-ietf-idn-nameprep-00.txt                            IMC & VPNC
July 3, 2000                                           Marc Blanchet
Expires in six months                                       ViaGenie

             Preparation of Internationalized Host Names

Status of this memo

This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other groups
may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference material
or to cite them other than as "work in progress."


     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt

     The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.


Abstract

This document describes how to prepare internationalized host names for
transmission on the wire. The steps include excluding characters that
are prohibited from appearing in internationalized host names, changing
all characters that have case properties to be lowercase, and
normalizing the characters. Further, this document lists the prohibited
characters.


1. Introduction

When expanding today's DNS to include internationalized host names,
those new names will be handled in many parts of the DNS. The IDN
Working Group's requirements document [IDNReq] describes a framework for
domain name handling as well as requirements for the new names. The IDN
Working Group's comparison document [IDNComp] gives a framework for how
various parts of the IDN solution work together.

A user can enter a domain name into an application program in a myriad
of fashions. Depending on the input method, the characters entered in
the domain name may or may not be those that are allowed in
internationalized host names. Thus, there must be a way to canonicalized
the user's input before the name is resolved in the DNS.

It is a design goal of this document to allow users to enter host names
in applications and have the highest chance of getting the name correct.
This means that the user should not be limited to only entering exactly
the characters that might have been used, but to instead be able to
enter characters that unambiguously canonicalize to characters in the
desired host name. At the same time, this process must not introduce any
chance that two host names could be represented by two distinct strings
of characters that look identical to typical users. It is also a design
goal to have all preprocessing of IDN done before going on the wire, so
that no transformation is done in the DNS server space.

This document describes the steps needed to convert a name part from one
that is entered by the user to one that can be used in the DNS.

1.1 Terminology

The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
"MAY" in this document are to be interpreted as described in RFC 2119
[RFC2119].

Examples in this document use the notation from the Unicode Standard
[Unicode3] as well as the ISO 10646 [ISO10646] names. For example, the
letter "a" may be represented as either "U+0061" or "LATIN SMALL LETTER
A". In the lists of prohibited characters, the "U+" is left off to make
the lists easier to read.

1.2 IDN summary

Using the terminology in [IDNComp], this document specifies all of the
prohibited characters and the canonicalization for an IDN solution.
Specifically, it covers the following sections from [IDNComp]:

prohib-1: Identical and near-identical characters
prohib-2: Separators
prohib-3: Non-displaying and non-spacing characters
prohib-4: Private use characters
prohib-5: Punctuation
prohib-6: Symbols
canon-1.2: Normalization Form KC
canon-2.1: Case folding in ASCII
canon-2.2: Case folding in non-ASCII

Note that this document does not cover:
canon-1.1: Normalization Form C
canon-2.3: Han folding

1.3 Open issues

This is the first draft of this document. Although there has been much
discussion on the WG mailing list about the topics here, there has not
yet been much agreement on some issues. Now that there is a document to
talk about, that discussion can be more focussed.

1.3.1 Where to do name preparation

Section 2.1 says to do name preparation in the resolver. An argument can
be made for doing name preparation in the application, before the
application service interface. An advantage of that proposal is that
resolvers would not need to do any name preparation. A disadvantage is
that applications would have to be updated each time the IDN protocol is
updated, such as if new characters are added to the repertoire of
allowed characters. It seems likely that resolvers are more easily
updated than all the individual applications that use internationalized
host names.

1.3.2 Choosing between normalization form C and KC

Much of the discussion of normalization on the WG mailing list assumed
that normalization form C would be used. Near the time that this
document was written, people started considering form KC instead of C.
This document used form KC, but the reasons for doing so could be
contentious.

1.3.3 Does the prohibition catch all bad characters?

On the mailing list, it was discussed doing prohibition in two steps: a
short list of prohibited characters before case folding in order to
prevent uppercase characters that have no lowercase equivalents from
getting through, and then a full check on the output of normalization.
In this draft, all checking is done before case folding, based on the
(possibly wrong) assumption that none of the prohibited characters will
re-appear after the case folding and normalization. If that assumption
turns out to be wrong, a check for just those problematic characters can
be added after normalization, or a full check against the prohibited
characters can be added.


2. Preparation Overview

This section describes where name preparation happens and the steps that
name preparation software must take.

2.1 Where name preparation happens

Part of the chart in section 1.4 of [IDNReq] looks like this:

+---------------+
| Application   |
+---------------+
      |  Application service interface
      |  For ex. GethostbyXXXX interface
+---------------+
| Resolver      |
+---------------+
      |     <-----   DNS service interface
+-------------------------------------------+
 
In this specification, the name preparation is done in the resolver,
before the DNS service interface. That is, it is acceptable for software
in the application service interface (such as a "GetHostByName" API) to
pass the resolver a name that has not been prepared. However, the
resolver MUST prepare the name as described in this specification before
passing it to the DNS service interface.

2.2 Name preparation steps

The steps for preparing names are:

1) Input from the application service interface -- This can be done in
many ways and is not specified in this document

2) Look for prohibited input -- Check for any characters that are not
allowed in the input. If any are found, return an error to the
application service interface. This step is necessary to prevent errors
in the following two steps. This step fulfills prohib-1, prohib-2,
prohib-3, prohib-4, prohib-5, and prohib-6 from [IDNComp].

3) Fold case -- Change all uppercase characters into lowercase
characters. Design note: this step could just as easily have been
"change all lowercase characters into uppercase characters". However,
the upper-to-lower folding was chosen because most users of the Internet
today enter host names in lowercase. This step fulfills canon-2.1 and
canon-2.2 from [IDNComp].

4) Canonicalize -- Normalize the characters. This step fulfils canon-1.2
from [IDNComp].

5) Resolution of the prepared name -- This must be specified in a
different IDN document.

The above steps MUST be performed in the order given in order to comply
with this specification.


3. Prohibited Input

Before the text can be processed, it must be checked for prohibited
characters. There is a variety of prohibited characters, as described in
this section.

Note that one of the goals of IDN is to allow the widest possible set of
host names as long as those host names do not cause other problems, such
as possible ambiguity. Specifically, experience with current DNS names
have shown that there is a desire for host names that include personal
names, company names, and spoken phrases. A goal of this section is to
prohibit as few characters that might be used in these contexts as
possible while making sure that characters that might easily cause
confusion or ambiguity are prohibited.

Note that every character listed in this section MUST NOT be transmitted
on the DNS service interface. Although the checking is being performed
before case folding and canonicalization, those steps cannot result in
any of these characters if these characters are not in the input stream.
[[[NOTE: THIS STATEMENT NEEDS TO BE CHECKED ALGORITHMICALLY.]]] If a DNS
server receives a request containing a prohibited character, then the
IDN protocol MUST return an error message.


Note that some characters listed in one section would also appear in
other sections. Each character is only listed once.

3.1 prohib-1: Identical and near-identical characters

Many characters in [ISO10646] are identical or nearly identical to other
characters. These were often included for compatibility with other
character sets.

The characters prohibited because they are identical or nearly identical
to allowed characters are:

00AD        SOFT HYPHEN
00D7        MULTIPLICATION SIGN
01C3        LATIN LETTER RETROFLEX CLICK
02B0-02FF   [SPACING MODIFIER LETTERS]
066D        ARABIC FIVE POINTED STAR
1806        MONGOLIAN TODO SOFT HYPHEN
2010        HYPHEN
2011        NON-BREAKING HYPHEN
2012        FIGURE DASH
2013        EN DASH
2014        EM DASH
2160-217F   [ROMAN NUMERALS]
FB1D-FB4F   [HEBREW PRESENTATION FORMS]
FB50-FDFF   [ARABIC PRESENTATION FORMS A]
FE20-FE2F   [COMBINING HALF MARKS]
FE30-FE4F   [CJK COMPATIBILITY FORMS]
FE50-FE6F   [SMALL FORM VARIANTS]
FE70-FEFC   [ARABIC PRESENTATION FORMS B]
FF00-FFEF   [HALFWIDTH AND FULLWIDTH FORMS]

3.2 prohib-2: Separators

Horizontal and vertical spacing characters would make it unclear where a
host name begins and ends. The prohibited spacing characters are:

0020        SPACE
00A0        NO-BREAK SPACE
1680        OGHAM SPACE MARK
2000-200B   [SPACES]
2028        LINE SEPARATOR
2029        PARAGRAPH SEPARATOR
202F        NARROW NO-BREAK SPACE
3000        IDEOGRAPHIC SPACE

Allowing periods and period-like characters as characters within a name
part would also cause similar confusion. The prohibited periods,
characters that look like periods, and characters that canonicalize to a
period or to a period-like character are:

002E        FULL STOP
06D4        ARABIC FULL STOP
2024        ONE DOT LEADER
2025        TWO DOT LEADER
2026        HORIZONTAL ELLIPSIS
2488        DIGIT ONE FULL STOP
2489        DIGIT TWO FULL STOP
248A        DIGIT THREE FULL STOP
248B        DIGIT FOUR FULL STOP
248C        DIGIT FIVE FULL STOP
248D        DIGIT SIX FULL STOP
248E        DIGIT SEVEN FULL STOP
248F        DIGIT EIGHT FULL STOP
2490        DIGIT NINE FULL STOP
2491        NUMBER TEN FULL STOP
2492        NUMBER ELEVEN FULL STOP
2493        NUMBER TWELVE FULL STOP
2494        NUMBER THIRTEEN FULL STOP
2495        NUMBER FOURTEEN FULL STOP
2496        NUMBER FIFTEEN FULL STOP
2497        NUMBER SIXTEEN FULL STOP
2498        NUMBER SEVENTEEN FULL STOP
2499        NUMBER EIGHTEEN FULL STOP
249A        NUMBER NINETEEN FULL STOP
249B        NUMBER TWENTY FULL STOP
33C2        SQUARE AM
33C2        SQUARE AM
33C7        SQUARE CO
33D8        SQUARE PM
33D8        SQUARE PM

3.3 prohib-3: Non-displaying and non-spacing characters

There are many characters that cannot be seen in the ISO 10646 character
set. These include control characters, non-breaking spaces, formatting
characters, and tagging characters. These characters would certainly
cause confusion if allowed in host names.

0000-001F   [CONTROL CHARACTERS]
007F        DELETE
0080-009F   [CONTROL CHARACTERS]
070F        SYRIAC ABBREVIATION MARK
180B        MONGOLIAN FREE VARIATION SELECTOR ONE
180C        MONGOLIAN FREE VARIATION SELECTOR TWO
180D        MONGOLIAN FREE VARIATION SELECTOR THREE
180E        MONGOLIAN VOWEL SEPARATOR
200C        ZERO WIDTH NON-JOINER
200D        ZERO WIDTH JOINER
200E        LEFT-TO-RIGHT MARK
200F        RIGHT-TO-LEFT MARK
202A        LEFT-TO-RIGHT EMBEDDING
202B        RIGHT-TO-LEFT EMBEDDING
202C        POP DIRECTIONAL FORMATTING
202D        LEFT-TO-RIGHT OVERRIDE
202E        RIGHT-TO-LEFT OVERRIDE
206A        INHIBIT SYMMETRIC SWAPPING
206B        ACTIVATE SYMMETRIC SWAPPING
206C        INHIBIT ARABIC FORM SHAPING
206D        ACTIVATE ARABIC FORM SHAPING
206E        NATIONAL DIGIT SHAPES
206F        NOMINAL DIGIT SHAPES
FEFF        ZERO WIDTH NO-BREAK SPACE
FFF9        INTERLINEAR ANNOTATION ANCHOR
FFFA        INTERLINEAR ANNOTATION SEPARATOR
FFFB        INTERLINEAR ANNOTATION TERMINATOR
FFFC        OBJECT REPLACEMENT CHARACTER
FFFD        REPLACEMENT CHARACTER

3.4 prohib-4: Private use characters

Because private-use characters do not have defined meanings, they are
prohibited. The private-use characters are:

E000-F8FF   [PRIVATE USE, PLANE 0]

3.5 prohib-5: Punctuation

The following characters are reserved or delimiters in URLs [RFC2396]
and [RFC2732]:

" # $ % & + , . / : ; < = > ? @ [ ]

3.5.1 Characters from URLs

The following punctuation characters are prohibited because they are
reserved or delimiters in URLs.

0022        QUOTATION MARK
0023        NUMBER SIGN
0024        DOLLAR SIGN
0025        PERCENT SIGN
0026        AMPERSAND
002B        PLUS SIGN
002C        COMMA
002E        FULL STOP
002F        SOLIDUS
003A        COLON
003B        SEMICOLON
003C        LESS-THAN SIGN
003D        EQUALS SIGN
003E        GREATER-THAN SIGN
003F        QUESTION MARK
0040        COMMERCIAL AT
005B        LEFT SQUARE BRACKET
005D        RIGHT SQUARE BRACKET

3.5.2 Characters that canonicalize to characters from URLs

The following punctuation characters are prohibited because their
normalization contains one or more of the characters from section 3.5.1.

037E        GREEK QUESTION MARK
2048        QUESTION EXCLAMATION MARK
2049        EXCLAMATION QUESTION MARK
207A        SUPERSCRIPT PLUS SIGN
207C        SUPERSCRIPT EQUALS SIGN
208A        SUBSCRIPT PLUS SIGN
208C        SUBSCRIPT EQUALS SIGN
2100        ACCOUNT OF
2101        ADDRESSED TO THE SUBJECT
2105        CARE OF
2106        CADA UNA

3.5.3 Characters that look like characters from URLs

The following are prohibited because they look indistinguishable from
the characters listed in section 3.5.1.

037E        GREEK QUESTION MARK
0589        ARMENIAN FULL STOP
060C        ARABIC COMMA
061B        ARABIC SEMICOLON
066A        ARABIC PERCENT SIGN
201A        SINGLE LOW-9 QUOTATION MARK
2030        PER MILLE SIGN
2031        PER TEN THOUSAND SIGN
2033        DOUBLE PRIME
2039        SINGLE LEFT-POINTING ANGLE QUOTATION MARK
2044        FRACTION SLASH
203A        SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
203D        INTERROBANG
3001        IDEOGRAPHIC COMMA
3002        IDEOGRAPHIC FULL STOP
3003        DITTO MARK
3008        LEFT ANGLE BRACKET
3009        RIGHT ANGLE BRACKET
3014        LEFT TORTOISE SHELL BRACKET
3015        RIGHT TORTOISE SHELL BRACKET
301A        LEFT WHITE SQUARE BRACKET
301B        RIGHT WHITE SQUARE BRACKET

3.5.4 Other punctuation

The following punctuation are prohibited because they are unlikely to
be used in names and may be confusing to users or to character-entry
processes:

005C        REVERSE SOLIDUS

3.6 prohib-6: Symbols

[UniData] has non-normative categories for symbols. The four symbol
categories are:

Symbol, Currency: Currency symbols could appear in company names and
spoken phrases, so they are not prohibited.

Symbol, Modifier: Stand-alone modifiers might appear in personal names,
company names, and spoken phrases, so they are not prohibited.

Symbol, Math: It is very unlikely that there are any significant
personal names, company names, or spoken phrases that contain
mathematical symbols. Further, many of these symbols are the same or
similar to other punctuation, thereby leading to ambiguity. For this
reason, math-specific symbols are prohibited. These prohibited math
symbols are:

00AC        NOT SIGN
00B1        PLUS-MINUS SIGN
2200-22FF   [MATHEMATICAL OPERATORS]

Further, the following characters canonicalize to characters in the
above math list, and therefore are also prohibited:

00BC        VULGAR FRACTION ONE QUARTER
00BD        VULGAR FRACTION ONE HALF
00BE        VULGAR FRACTION THREE QUARTERS
207B        SUPERSCRIPT MINUS
208B        SUBSCRIPT MINUS
2153        VULGAR FRACTION ONE THIRD
2154        VULGAR FRACTION TWO THIRDS
2155        VULGAR FRACTION ONE FIFTH
2156        VULGAR FRACTION TWO FIFTHS
2157        VULGAR FRACTION THREE FIFTHS
2158        VULGAR FRACTION FOUR FIFTHS
2159        VULGAR FRACTION ONE SIXTH
215A        VULGAR FRACTION FIVE SIXTHS
215B        VULGAR FRACTION ONE EIGHTH
215C        VULGAR FRACTION THREE EIGHTHS
215D        VULGAR FRACTION FIVE EIGHTHS
215E        VULGAR FRACTION SEVEN EIGHTHS
215F        FRACTION NUMERATOR ONE
33A7        SQUARE M OVER S
33A8        SQUARE M OVER S SQUARED
33AE        SQUARE RAD OVER S
33AF        SQUARE RAD OVER S SQUARED
33C6        SQUARE C OVER KG

Symbol, Other: This category covers a multitude of symbols, few of which
would ever appear in personal names, company names, and spoken phrases.
The rest of the prohibited symbols are:

2190-21FF   [ARROWS]
2300-23FF   [MISCELLANEOUS TECHNICAL]
2400-243F   [CONTROL PICTURES]
2440-245F   [OPTICAL CHARACTER RECOGNITION]
2500-257F   [BOX DRAWING]
2580-259F   [BLOCK ELEMENTS]
25A0-25FF   [GEOMETRIC SHAPES]
2600-267F   [MISCELLANEOUS SYMBOLS]
2700-27BF   [DINGBATS]
2800-287F   [BRAILLE PATTERNS]

3.7 Additional prohibited characters

3.7.1 Unassigned characters

All characters not yet assigned in [ISO10646] are prohibited. Although
this may at first seem trivial, it is extremely important because
characters that may be assigned in the future might have properties that
would cause them to be prohibited or might have case-folding properties.
As is the case of all prohibited characters, if a DNS server receives a
request containing an unassigned character, then the IDN protocol MUST
return an error message.

3.7.2 Surrogate characters

So far, all proposals for binary encodings of internationalized name
parts have specified UTF-8 as the encoding format. In such an encoding,
surrogate characters MUST NOT be used. Therefore, for UTF-8 encodings,
the following are prohibited:

D800-DFFF   [SURROGATE CHARACTERS]

3.7.3 Uppercase characters with no lowercase mappings

There are many uppercase characters in [ISO10646] which do not have
lowercase equivalents in [UniData]. Therefore, they are prohibited on
input because they would get through the case mapping step while still
being in uppercase.

The characters that are prohibited on input because they are uppercase
but have no lowercase mappings are:

03D2        GREEK UPSILON WITH HOOK SYMBOL
03D3        GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
03D4        GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL
04C0        CYRILLIC LETTER PALOCHKA
10A0-10C5   [GEORGIAN CAPITAL LETTERS]

Note that many characters in the range U+1200 to U+213A, the letterlike
symbols, also are uppercase but have no lowercase mappings. However,
they are not listed here because the entire range is already prohibited
in section 3.6.

3.7.4 Radicals and Ideographic Description

Some Han characters can be informally defined in terms of ideographic
descriptions. However, ideographic descriptions can lead to multiple
character streams leading to the same character in a fashion that does
not canonicalize. Thus, the radicals for ideographic description and the
ideographic description characters themselves are prohibited. These
characters are:

2E80-2EFF   [CJK RADICALS SUPPLEMENT]
2F00-2FDF   [KANGXI RADICALS]
2FF0-2FFF   [IDEOGRAPHIC DESCRIPTION CHARACTERS]

3.8 Summary of prohibited characters

The following is a collected list from the previous sections.

0000-001F   [CONTROL CHARACTERS]
0020        SPACE
0022        QUOTATION MARK
0023        NUMBER SIGN
0024        DOLLAR SIGN
0025        PERCENT SIGN
0026        AMPERSAND
002B        PLUS SIGN
002C        COMMA
002E        FULL STOP
002E        FULL STOP
002F        SOLIDUS
003A        COLON
003B        SEMICOLON
003C        LESS-THAN SIGN
003D        EQUALS SIGN
003E        GREATER-THAN SIGN
003F        QUESTION MARK
0040        COMMERCIAL AT
005B        LEFT SQUARE BRACKET
005C        REVERSE SOLIDUS
005D        RIGHT SQUARE BRACKET
007F        DELETE
0080-009F   [CONTROL CHARACTERS]
00A0        NO-BREAK SPACE
00AC        NOT SIGN
00AD        SOFT HYPHEN
00B1        PLUS-MINUS SIGN
00BC        VULGAR FRACTION ONE QUARTER
00BD        VULGAR FRACTION ONE HALF
00BE        VULGAR FRACTION THREE QUARTERS
00D7        MULTIPLICATION SIGN
01C3        LATIN LETTER RETROFLEX CLICK
02B0-02FF   [SPACING MODIFIER LETTERS]
037E        GREEK QUESTION MARK
037E        GREEK QUESTION MARK
03D2        GREEK UPSILON WITH HOOK SYMBOL
03D3        GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
03D4        GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL
04C0        CYRILLIC LETTER PALOCHKA
0589        ARMENIAN FULL STOP
060C        ARABIC COMMA
061B        ARABIC SEMICOLON
066A        ARABIC PERCENT SIGN
066D        ARABIC FIVE POINTED STAR
06D4        ARABIC FULL STOP
070F        SYRIAC ABBREVIATION MARK
10A0-10C5   [GEORGIAN CAPITAL LETTERS]
1680        OGHAM SPACE MARK
1806        MONGOLIAN TODO SOFT HYPHEN
180B        MONGOLIAN FREE VARIATION SELECTOR ONE
180C        MONGOLIAN FREE VARIATION SELECTOR TWO
180D        MONGOLIAN FREE VARIATION SELECTOR THREE
180E        MONGOLIAN VOWEL SEPARATOR
2000-200B   [SPACES]
200C        ZERO WIDTH NON-JOINER
200D        ZERO WIDTH JOINER
200E        LEFT-TO-RIGHT MARK
200F        RIGHT-TO-LEFT MARK
2010        HYPHEN
2011        NON-BREAKING HYPHEN
2012        FIGURE DASH
2013        EN DASH
2014        EM DASH
201A        SINGLE LOW-9 QUOTATION MARK
2024        ONE DOT LEADER
2025        TWO DOT LEADER
2026        HORIZONTAL ELLIPSIS
2028        LINE SEPARATOR
2029        PARAGRAPH SEPARATOR
202A        LEFT-TO-RIGHT EMBEDDING
202B        RIGHT-TO-LEFT EMBEDDING
202C        POP DIRECTIONAL FORMATTING
202D        LEFT-TO-RIGHT OVERRIDE
202E        RIGHT-TO-LEFT OVERRIDE
202F        NARROW NO-BREAK SPACE
2030        PER MILLE SIGN
2031        PER TEN THOUSAND SIGN
2033        DOUBLE PRIME
2039        SINGLE LEFT-POINTING ANGLE QUOTATION MARK
203A        SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
203D        INTERROBANG
2044        FRACTION SLASH
2048        QUESTION EXCLAMATION MARK
2049        EXCLAMATION QUESTION MARK
206A        INHIBIT SYMMETRIC SWAPPING
206B        ACTIVATE SYMMETRIC SWAPPING
206C        INHIBIT ARABIC FORM SHAPING
206D        ACTIVATE ARABIC FORM SHAPING
206E        NATIONAL DIGIT SHAPES
206F        NOMINAL DIGIT SHAPES
207A        SUPERSCRIPT PLUS SIGN
207B        SUPERSCRIPT MINUS
207C        SUPERSCRIPT EQUALS SIGN
208A        SUBSCRIPT PLUS SIGN
208B        SUBSCRIPT MINUS
208C        SUBSCRIPT EQUALS SIGN
2100        ACCOUNT OF
2101        ADDRESSED TO THE SUBJECT
2105        CARE OF
2106        CADA UNA
2153        VULGAR FRACTION ONE THIRD
2154        VULGAR FRACTION TWO THIRDS
2155        VULGAR FRACTION ONE FIFTH
2156        VULGAR FRACTION TWO FIFTHS
2157        VULGAR FRACTION THREE FIFTHS
2158        VULGAR FRACTION FOUR FIFTHS
2159        VULGAR FRACTION ONE SIXTH
215A        VULGAR FRACTION FIVE SIXTHS
215B        VULGAR FRACTION ONE EIGHTH
215C        VULGAR FRACTION THREE EIGHTHS
215D        VULGAR FRACTION FIVE EIGHTHS
215E        VULGAR FRACTION SEVEN EIGHTHS
215F        FRACTION NUMERATOR ONE
2160-217F   [ROMAN NUMERALS]
2190-21FF   [ARROWS]
2200-22FF   [MATHEMATICAL OPERATORS]
2300-23FF   [MISCELLANEOUS TECHNICAL]
2400-243F   [CONTROL PICTURES]
2440-245F   [OPTICAL CHARACTER RECOGNITION]
2488        DIGIT ONE FULL STOP
2489        DIGIT TWO FULL STOP
248A        DIGIT THREE FULL STOP
248B        DIGIT FOUR FULL STOP
248C        DIGIT FIVE FULL STOP
248D        DIGIT SIX FULL STOP
248E        DIGIT SEVEN FULL STOP
248F        DIGIT EIGHT FULL STOP
2490        DIGIT NINE FULL STOP
2491        NUMBER TEN FULL STOP
2492        NUMBER ELEVEN FULL STOP
2493        NUMBER TWELVE FULL STOP
2494        NUMBER THIRTEEN FULL STOP
2495        NUMBER FOURTEEN FULL STOP
2496        NUMBER FIFTEEN FULL STOP
2497        NUMBER SIXTEEN FULL STOP
2498        NUMBER SEVENTEEN FULL STOP
2499        NUMBER EIGHTEEN FULL STOP
249A        NUMBER NINETEEN FULL STOP
249B        NUMBER TWENTY FULL STOP
2500-257F   [BOX DRAWING]
2580-259F   [BLOCK ELEMENTS]
25A0-25FF   [GEOMETRIC SHAPES]
2600-267F   [MISCELLANEOUS SYMBOLS]
2700-27BF   [DINGBATS]
2800-287F   [BRAILLE PATTERNS]
2E80-2EFF   [CJK RADICALS SUPPLEMENT]
2F00-2FDF   [KANGXI RADICALS]
2FF0-2FFF   [IDEOGRAPHIC DESCRIPTION CHARACTERS]
3000        IDEOGRAPHIC SPACE
3001        IDEOGRAPHIC COMMA
3002        IDEOGRAPHIC FULL STOP
3003        DITTO MARK
3008        LEFT ANGLE BRACKET
3009        RIGHT ANGLE BRACKET
33A7        SQUARE M OVER S
33A8        SQUARE M OVER S SQUARED
33AE        SQUARE RAD OVER S
33AF        SQUARE RAD OVER S SQUARED
33C2        SQUARE AM
33C2        SQUARE AM
33C6        SQUARE C OVER KG
33C7        SQUARE CO
33D8        SQUARE PM
33D8        SQUARE PM
D800-DFFF   [SURROGATE CHARACTERS]
E000-F8FF   [PRIVATE USE, PLANE 0]
FB1D-FB4F   [HEBREW PRESENTATION FORMS]
FB50-FDFF   [ARABIC PRESENTATION FORMS A]
FE20-FE2F   [COMBINING HALF MARKS]
FE30-FE4F   [CJK COMPATIBILITY FORMS]
FE50-FE6F   [SMALL FORM VARIANTS]
FE70-FEFC   [ARABIC PRESENTATION FORMS B]
FEFF        ZERO WIDTH NO-BREAK SPACE
FF00-FFEF   [HALFWIDTH AND FULLWIDTH FORMS]
FFF9        INTERLINEAR ANNOTATION ANCHOR
FFFA        INTERLINEAR ANNOTATION SEPARATOR
FFFB        INTERLINEAR ANNOTATION TERMINATOR
FFFC        OBJECT REPLACEMENT CHARACTER
FFFD        REPLACEMENT CHARACTER
Unassigned characters


4. Case Folding

After it has been verified that the input text has none of the
characters prohibited for case folding, the case-folding step itself is
quite straight-forward. For each character in the input, if there is a
lowercase mapping for that character in [UniData], the input character
is changed to the mapped lowercase letter.


5. Canonicalization

After case folding, the input string is normalized using form KC, as
described in [UTR15].

6. IDN Table Revisions

A table consisting of all characters allowed and prohibited and the
rules for case folding and canonicalization will be created based on the
content of the [UniData] and on the content of this document. This table
will be the authority for implementations to follow and will be
normatively referenced by this document. Such a table will enable the
IDN protocol to have versions independent of the revisions to Unicode
and/or to ISO 10646 because the revision of IDN and its deployment may
not in sync with revisions to Unicode and ISO 10646.

In a future draft of this document, IANA will be asked to keep this
table, with an initial version number of 1. Each new version of the
table will have a new, higher version number.


7. Security Considerations

Much of the security of the Internet relies on the DNS. Thus, any change
to the characteristics of the DNS can change the security of much of the
Internet.

Host names are used by users to connect to Internet servers. The
security of the Internet would be compromised if a user entering a
single internationalized name could be connected to different servers
based on different interpretations of the internationalized host name.


8. References

[IDNComp] Paul Hoffman, "Comparison of Internationalized Domain Name
Proposals", draft-ietf-idn-compare.

[IDNReq] James Seng, "Requirements of Internationalized Domain Names",
draft-ietf-idn-requirement.

[ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information
technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part
1: Architecture and Basic Multilingual Plane.  Five amendments and a
technical corrigendum have been published up to now. UTF-16 is described
in Annex Q, published as Amendment 1. 17 other amendments are currently
at various stages of standardization. [[[ THIS REFERENCE NEEDS TO BE
UPDATED AFTER DETERMINING ACCEPTABLE WORDING ]]]

[Normalize] Character Normalization in IETF Protocols,
draft-duerst-i18n-norm-03

[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", March 1997, RFC 2119.

[RFC2396] Tim Berners-Lee, et. al., "Uniform Resource Identifiers (URI):
Generic Syntax", August 1998, RFC 2396.

[RFC2732] Robert Hinden, et. al., Format for Literal IPv6 Addresses in
URL's, December 1999, RFC 2732.

[STD13] Paul Mockapetris, "Domain names - implementation and
specification", November 1987, STD 13 (RFC 1035).

[Unicode3] The Unicode Consortium, "The Unicode Standard -- Version
3.0", ISBN 0-201-61633-5. Described at
<http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.

[UniData] The Unicode Consortium. UnicodeData File.
<ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>.

[UTR15] Mark Davis and Martin Duerst. Unicode Normalization Forms.
Unicode Technical Report #15.
<http://www.unicode.org/unicode/reports/tr15/>.


A. Acknowledgements

Many people from the IETF IDN Working Group and the Unicode Technical
Committee contributed ideas that went into the first draft of this
document. Mark Davis was particularly helpful in some of the early
ideas.


B. Changes From Previous Versions of this Draft

This is the -00 version, so there are no changes.


C. IANA Considerations

There are no specific IANA considerations in this draft, but there will
be in a future draft of this document.


D. Author Contact Information

Paul Hoffman
Internet Mail Consortium and VPN Consortium
127 Segre Place
Santa Cruz, CA  95060 USA
paul.hoffman@imc.org and paul.hoffman@vpnc.org

Marc Blanchet
Viagenie inc.
2875 boul. Laurier, bur. 300
Ste-Foy, Quebec, Canada, G1V 2M2
Marc.Blanchet@viagenie.qc.ca