spec/rfc5893.txt at main · anil.recoil.org/ocaml-punycode

Punycode (RFC3492) in OCaml
ocaml-punycode / spec / rfc5893.txt
at main 955 lines 39 kB view raw
wrap content
  1
  2
  3
  4
  5
  6
  7Internet Engineering Task Force (IETF)                H. Alvestrand, Ed.
  8Request for Comments: 5893                                        Google
  9Category: Standards Track                                        C. Karp
 10ISSN: 2070-1721                        Swedish Museum of Natural History
 11                                                             August 2010
 12
 13
 14                       Right-to-Left Scripts for
 15         Internationalized Domain Names for Applications (IDNA)
 16
 17Abstract
 18
 19   The use of right-to-left scripts in Internationalized Domain Names
 20   (IDNs) has presented several challenges.  This memo provides a new
 21   Bidi rule for Internationalized Domain Names for Applications (IDNA)
 22   labels, based on the encountered problems with some scripts and some
 23   shortcomings in the 2003 IDNA Bidi criterion.
 24
 25Status of This Memo
 26
 27   This is an Internet Standards Track document.
 28
 29   This document is a product of the Internet Engineering Task Force
 30   (IETF).  It represents the consensus of the IETF community.  It has
 31   received public review and has been approved for publication by the
 32   Internet Engineering Steering Group (IESG).  Further information on
 33   Internet Standards is available in Section 2 of RFC 5741.
 34
 35   Information about the current status of this document, any errata,
 36   and how to provide feedback on it may be obtained at
 37   http://www.rfc-editor.org/info/rfc5893.
 38
 39Copyright Notice
 40
 41   Copyright (c) 2010 IETF Trust and the persons identified as the
 42   document authors.  All rights reserved.
 43
 44   This document is subject to BCP 78 and the IETF Trust's Legal
 45   Provisions Relating to IETF Documents
 46   (http://trustee.ietf.org/license-info) in effect on the date of
 47   publication of this document.  Please review these documents
 48   carefully, as they describe your rights and restrictions with respect
 49   to this document.  Code Components extracted from this document must
 50   include Simplified BSD License text as described in Section 4.e of
 51   the Trust Legal Provisions and are provided without warranty as
 52   described in the Simplified BSD License.
 53
 54
 55
 56
 57
 58Alvestrand & Karp            Standards Track                    [Page 1]
 59
 60RFC 5893                   IDNA Right to Left                August 2010
 61
 62
 63Table of Contents
 64
 65   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  2
 66     1.1.  Purpose and Applicability  . . . . . . . . . . . . . . . .  2
 67     1.2.  Background and History . . . . . . . . . . . . . . . . . .  3
 68     1.3.  Structure of the Rest of This Document . . . . . . . . . .  3
 69     1.4.  Terminology  . . . . . . . . . . . . . . . . . . . . . . .  4
 70   2.  The Bidi Rule  . . . . . . . . . . . . . . . . . . . . . . . .  6
 71   3.  The Requirement Set for the Bidi Rule  . . . . . . . . . . . .  6
 72   4.  Examples of Issues Found with RFC 3454 . . . . . . . . . . . .  9
 73     4.1.  Dhivehi  . . . . . . . . . . . . . . . . . . . . . . . . .  9
 74     4.2.  Yiddish  . . . . . . . . . . . . . . . . . . . . . . . . . 10
 75     4.3.  Strings with Numbers . . . . . . . . . . . . . . . . . . . 12
 76   5.  Troublesome Situations and Guidelines  . . . . . . . . . . . . 12
 77   6.  Other Issues in Need of Resolution . . . . . . . . . . . . . . 13
 78   7.  Compatibility Considerations . . . . . . . . . . . . . . . . . 14
 79     7.1.  Backwards Compatibility Considerations . . . . . . . . . . 14
 80     7.2.  Forward Compatibility Considerations . . . . . . . . . . . 15
 81   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 15
 82   9.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 16
 83   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16
 84     10.1. Normative References . . . . . . . . . . . . . . . . . . . 16
 85     10.2. Informative References . . . . . . . . . . . . . . . . . . 17
 86
 871.  Introduction
 88
 891.1.  Purpose and Applicability
 90
 91   The purpose of this document is to establish a rule that can be
 92   applied to Internationalized Domain Name (IDN) labels in Unicode form
 93   (U-labels) containing characters from scripts that are written from
 94   right to left.  It is part of the revised IDNA protocol [RFC5891].
 95
 96   When labels satisfy the rule, and when certain other conditions are
 97   satisfied, there is only a minimal chance of these labels being
 98   displayed in a confusing way by the Unicode bidirectional display
 99   algorithm.
100
101   The other normative documents in the IDNA2008 document set establish
102   criteria for valid labels, including listing the permitted
103   characters.  This document establishes additional validity criteria
104   for labels in scripts normally written from right to left.
105
106   This specification is not intended to place any requirements on
107   domain names that do not contain characters from such scripts.
108
109
110
111
112
113
114Alvestrand & Karp            Standards Track                    [Page 2]
115
116RFC 5893                   IDNA Right to Left                August 2010
117
118
1191.2.  Background and History
120
121   The "Stringprep" specification [RFC3454], part of IDNA2003, made the
122   following statement in its Section 6 on the Bidi algorithm:
123
124      3) If a string contains any RandALCat character, a RandALCat
125      character MUST be the first character of the string, and a
126      RandALCat character MUST be the last character of the string.
127
128   (A RandALCat character is a character with unambiguously
129   right-to-left directionality.)
130
131   The reasoning behind this prohibition was to ensure that every
132   component of a displayed domain name has an unambiguously preferred
133   direction.  However, this made certain words in languages written
134   with right-to-left scripts invalid as IDN labels, and in at least one
135   case (Dhivehi) meant that all the words of an entire language were
136   forbidden as IDN labels.
137
138   This is illustrated below with examples taken from the Dhivehi and
139   Yiddish languages, as written with the Thaana and Hebrew scripts,
140   respectively.
141
142   RFC 3454 did not explicitly state the requirement to be fulfilled.
143   Therefore, it is impossible to determine whether a simple relaxation
144   of the rule would continue to fulfill the requirement.
145
146   While this document specifies rules quite different from RFC 3454,
147   most reasonable labels that were allowed under RFC 3454 will also be
148   allowed under this specification (the most important example of
149   non-permitted labels being labels that mix Arabic and European digits
150   (AN and EN) inside an RTL label, and labels that use AN in an LTR
151   label -- see Section 1.4 for terminology), so the operational impact
152   of using the new rule in the updated IDNA specification is limited.
153
1541.3.  Structure of the Rest of This Document
155
156   Section 2 defines a rule, the "Bidi rule", which can be used on a
157   domain name label to check how safe it is to use in a domain name of
158   possibly mixed directionality.  The primary initial use of this rule
159   is as part of the IDNA2008 protocol [RFC5891].
160
161   Section 3 sets out the requirements for defining the Bidi rule.
162
163   Section 4 gives detailed examples that serve as justification for the
164   new rule.
165
166
167
168
169
170Alvestrand & Karp            Standards Track                    [Page 3]
171
172RFC 5893                   IDNA Right to Left                August 2010
173
174
175   Section 5 to Section 8 describe various situations that can occur
176   when dealing with domain names with characters of different
177   directionality.
178
179   Only Section 1.4 and Section 2 are normative.
180
1811.4.  Terminology
182
183   The terminology used to describe IDNA concepts is defined in the
184   Definitions document [RFC5890].
185
186   The terminology used for the Bidi properties of Unicode characters is
187   taken from the Unicode Standard [Unicode52].
188
189   The Unicode Standard specifies a Bidi property for each character.
190   That property controls the character's behavior in the Unicode
191   bidirectional algorithm [Unicode-UAX9].  For reference, here are the
192   values that the Unicode Bidi property can have:
193
194   o  L - Left to right - most letters in LTR scripts
195
196   o  R - Right to left - most letters in non-Arabic RTL scripts
197
198   o  AL - Arabic letters - most letters in the Arabic script
199
200   o  EN - European Number (0-9, and Extended Arabic-Indic numbers)
201
202   o  ES - European Number Separator (+ and -)
203
204   o  ET - European Number Terminator (currency symbols, the hash sign,
205      the percent sign and so on)
206
207   o  AN - Arabic Number; this encompasses the Arabic-Indic numbers, but
208      not the Extended Arabic-Indic numbers
209
210   o  CS - Common Number Separator (. , / : et al)
211
212   o  NSM - Nonspacing Mark - most combining accents
213
214   o  BN - Boundary Neutral - control characters (ZWNJ, ZWJ, and others)
215
216   o  B - Paragraph Separator
217
218   o  S - Segment Separator
219
220   o  WS - Whitespace, including the SPACE character
221
222   o  ON - Other Neutrals, including @, &, parentheses, MIDDLE DOT
223
224
225
226Alvestrand & Karp            Standards Track                    [Page 4]
227
228RFC 5893                   IDNA Right to Left                August 2010
229
230
231   o  LRE, LRO, RLE, RLO, PDF - these are "directional control
232      characters" and are not used in IDNA labels.
233
234   In this memo, we use "network order" to describe the sequence of
235   characters as transmitted on the wire or stored in a file; the terms
236   "first", "next", "previous", "beginning", "end", "before", and
237   "after" are used to refer to the relationship of characters and
238   labels in network order.
239
240   We use "display order" to talk about the sequence of characters as
241   imaged on a display medium; the terms "left" and "right" are used to
242   refer to the relationship of characters and labels in display order.
243
244   Most of the time, the examples use the abbreviations for the Unicode
245   Bidi classes to denote the directionality of the characters; the
246   example string CS L consists of one character of class CS and one
247   character of class L.  In some examples, the convention that
248   uppercase characters are of class R or AL, and lowercase characters
249   are of class L is used -- thus, the example string ABC.abc would
250   consist of three right-to-left characters and three left-to-right
251   characters.
252
253   The directionality of such examples is determined by context -- for
254   instance, in the sentence "ABC.abc is displayed as CBA.abc", the
255   first example string is in network order, the second example string
256   is in display order.
257
258   The term "paragraph" is used in the sense of the Unicode Bidi
259   specification [Unicode-UAX9].  It means "a block of text that has an
260   overall direction, either left to right or right to left",
261   approximately; see the "Unicode Bidirectional Algorithm"
262   [Unicode-UAX9] for details.
263
264   "RTL" and "LTR" are abbreviations for "right to left" and "left to
265   right", respectively.
266
267   An RTL label is a label that contains at least one character of type
268   R, AL, or AN.
269
270   An LTR label is any label that is not an RTL label.
271
272   A "Bidi domain name" is a domain name that contains at least one RTL
273   label.  (Note: This definition includes domain names containing only
274   dots and right-to-left characters.  Providing a separate category of
275   "RTL domain names" would not make this specification simpler, so it
276   has not been done.)
277
278
279
280
281
282Alvestrand & Karp            Standards Track                    [Page 5]
283
284RFC 5893                   IDNA Right to Left                August 2010
285
286
2872.  The Bidi Rule
288
289   The following rule, consisting of six conditions, applies to labels
290   in Bidi domain names.  The requirements that this rule satisfies are
291   described in Section 3.  All of the conditions must be satisfied for
292   the rule to be satisfied.
293
294   1.  The first character must be a character with Bidi property L, R,
295       or AL.  If it has the R or AL property, it is an RTL label; if it
296       has the L property, it is an LTR label.
297
298   2.  In an RTL label, only characters with the Bidi properties R, AL,
299       AN, EN, ES, CS, ET, ON, BN, or NSM are allowed.
300
301   3.  In an RTL label, the end of the label must be a character with
302       Bidi property R, AL, EN, or AN, followed by zero or more
303       characters with Bidi property NSM.
304
305   4.  In an RTL label, if an EN is present, no AN may be present, and
306       vice versa.
307
308   5.  In an LTR label, only characters with the Bidi properties L, EN,
309       ES, CS, ET, ON, BN, or NSM are allowed.
310
311   6.  In an LTR label, the end of the label must be a character with
312       Bidi property L or EN, followed by zero or more characters with
313       Bidi property NSM.
314
315   The following guarantees can be made based on the above:
316
317   o  In a domain name consisting of only labels that satisfy the rule,
318      the requirements of Section 3 are satisfied.  Note that even LTR
319      labels and pure ASCII labels have to be tested.
320
321   o  In a domain name consisting of only LDH labels (as defined in the
322      Definitions document [RFC5890]) and labels that satisfy the rule,
323      the requirements of Section 3 are satisfied as long as a label
324      that starts with an ASCII digit does not come after a
325      right-to-left label.
326
327   No guarantee is given for other combinations.
328
3293.  The Requirement Set for the Bidi Rule
330
331   This document, unlike RFC 3454 [RFC3454], provides an explicit
332   justification for the Bidi rule, and states a set of requirements for
333   which it is possible to test whether or not the modified rule
334   fulfills the requirement.
335
336
337
338Alvestrand & Karp            Standards Track                    [Page 6]
339
340RFC 5893                   IDNA Right to Left                August 2010
341
342
343   All the text in this document assumes that text containing the labels
344   under consideration will be displayed using the Unicode bidirectional
345   algorithm [Unicode-UAX9].
346
347   The requirements proposed are these:
348
349   o  Label Uniqueness: No two labels, when presented in display order
350      in the same paragraph, should have the same sequence of characters
351      without also having the same sequence of characters in network
352      order, both when the paragraph has LTR direction and when the
353      paragraph has RTL direction.  (This is the criterion that is
354      explicit in RFC 3454).  (Note that a label displayed in an RTL
355      paragraph may display the same as a different label displayed in
356      an LTR paragraph and still satisfy this criterion.)
357
358   o  Character Grouping: When displaying a string of labels, using the
359      Unicode Bidi algorithm to reorder the characters for display, the
360      characters of each label should remain grouped between the
361      characters delimiting the labels, both when the string is embedded
362      in a paragraph with LTR direction and when it is embedded in a
363      paragraph with RTL direction.
364
365   Several stronger statements were considered and rejected, because
366   they seem to be impossible to fulfill within the constraints of the
367   Unicode bidirectional algorithm.  These include:
368
369   o  The appearance of a label should be unaffected by its embedding
370      context.  This proved impossible even for ASCII labels; the label
371      "123-A" will have a different display order in an RTL context than
372      in an LTR context.  (This particular example is, however,
373      disallowed anyway.)
374
375   o  The sequence of labels should be consistent with network order.
376      This proved impossible -- a domain name consisting of the labels
377      (in network order) L1.R2.R3.L4 will be displayed as L1.R3.R2.L4 in
378      an LTR context.  (In an RTL context, it will be displayed as
379      L4.R3.R2.L1).
380
381   o  No two domain names should be displayed the same, even under
382      differing directionality.  This was shown to be unsound, since the
383      domain name (in network order) ABC.abc will have display order
384      CBA.abc in an LTR context and abc.CBA in an RTL context, while the
385      domain name (network) abc.ABC will have display order abc.CBA in
386      an LTR context and CBA.abc in an RTL context.
387
388
389
390
391
392
393
394Alvestrand & Karp            Standards Track                    [Page 7]
395
396RFC 5893                   IDNA Right to Left                August 2010
397
398
399   One possible requirement was thought to be problematic, but turned
400   out to be satisfied by a string that obeys the proposed rules:
401
402   o  The Character Grouping requirement should be satisfied when
403      directional controls (LRE, RLE, RLO, LRO, PDF) are used in the
404      same paragraph (outside of the labels).  Because these controls
405      affect presentation order in non-obvious ways, by affecting the
406      "sor" and "eor" properties of the Unicode Bidi algorithm, the
407      conditions above require extra testing in order to figure out
408      whether or not they influence the display of the domain name.
409      Testing found that for the strings allowed under the rule
410      presented in this document, directional controls do not influence
411      the display of the domain name.
412
413   This is still not stated as a requirement, since it did not seem as
414   important as the stated requirements, but it is useful to know that
415   Bidi domain names where the labels satisfy the rule have this
416   property.
417
418   In the following descriptions, first-level bullets are used to
419   indicate rules or normative statements; second-level bullets are
420   commentary.
421
422   The Character Grouping requirement can be more formally stated as:
423
424   o  Let "Delimiterchars" be a set of characters with the Unicode Bidi
425      properties CS, WS, ON.  (These are commonly used to delimit labels
426      -- both the FULL STOP and the space are included.  They are not
427      allowed in domain labels.)
428
429      *  ET, though it commonly occurs next to domain names in practice,
430         is problematic: the context R CS L EN ET (for instance A.a1%)
431         makes the label L EN not satisfy the character grouping
432         requirement.
433
434      *  ES commonly occurs in labels as HYPHEN-MINUS, but could also be
435         used as a delimiter (for instance, the plus sign).  It is left
436         out here.
437
438   o  Let "unproblematic label" be a label that either satisfies the
439      requirements or does not contain any character with the Bidi
440      properties R, AL, or AN and does not begin with a character with
441      the Bidi property EN.  (Informally, "it does not start with a
442      number".)
443
444
445
446
447
448
449
450Alvestrand & Karp            Standards Track                    [Page 8]
451
452RFC 5893                   IDNA Right to Left                August 2010
453
454
455   A label X satisfies the Character Grouping requirement when, for any
456   Delimiter Character D1 and D2, and for any label S1 and S2 that is an
457   unproblematic label or an empty string, the following holds true:
458
459   If the string formed by concatenating S1, D1, X, D2, and S2 is
460   reordered according to the Bidi algorithm, then all the characters of
461   X in the reordered string are between D1 and D2, and no other
462   characters are between D1 and D2, both if the overall paragraph
463   direction is LTR and if the overall paragraph direction is RTL.
464
465   Note that the definition is self-referential, since S1 and S2 are
466   constrained to be "legal" by this definition.  This makes testing
467   changes to proposed rules a little complex, but does not create
468   problems for testing whether or not a given proposed rule satisfies
469   the criterion.
470
471   The "zero-length" case represents the case where a domain name is
472   next to something that isn't a domain name, separated by a delimiter
473   character.
474
475   Note about the position of BN: The Unicode bidirectional algorithm
476   specifies that a BN has an effect on the adjoining characters in
477   network order, not in display order, and are therefore treated as if
478   removed during Bidi processing ([Unicode-UAX9], Section 3.3.2, rule
479   X9 and Section 5.3).  Therefore, the question of "what position does
480   a BN have after reordering" is not meaningful.  It has been ignored
481   while developing the rules here.
482
483   The Label Uniqueness requirement can be formally stated as:
484
485   If two non-identical labels X and Y, embedded as for the test above,
486   displayed in paragraphs with the same directionality, are reordered
487   by the Bidi algorithm into the same sequence of code points, the
488   labels X and Y cannot both be legal.
489
4904.  Examples of Issues Found with RFC 3454
491
4924.1.  Dhivehi
493
494   Dhivehi, the official language of the Maldives, is written with the
495   Thaana script.  This script displays some of the characteristics of
496   the Arabic script, including its directional properties, and the
497   indication of vowels by the diacritical marking of consonantal base
498   characters.  This marking is obligatory, and both two consecutive
499   vowels and syllable-final consonants are indicated with unvoiced
500   combining marks.  Every Dhivehi word therefore ends with a combining
501   mark.
502
503
504
505
506Alvestrand & Karp            Standards Track                    [Page 9]
507
508RFC 5893                   IDNA Right to Left                August 2010
509
510
511   The word for "computer", which is romanized as "konpeetaru", is
512   written with the following sequence of Unicode code points:
513
514      U+0786 THAANA LETTER KAAFU (AL)
515
516      U+07AE THAANA OBOFILI (NSM)
517
518      U+0782 THAANA LETTER NOONU (AL)
519
520      U+07B0 THAANA SUKUN (NSM)
521
522      U+0795 THAANA LETTER PAVIYANI (AL)
523
524      U+07A9 THAANA LETTER EEBEEFILI (AL)
525
526      U+0793 THAANA LETTER TAVIYANI (AL)
527
528      U+07A6 THAANA ABAFILI (NSM)
529
530      U+0783 THAANA LETTER RAA (AL)
531
532      U+07AA THAANA UBUFILI (NSM)
533
534   The directionality class of U+07AA in the Unicode database
535   [Unicode52] is NSM (Nonspacing Mark), which is not R or AL; a
536   conformant implementation of the IDNA2003 algorithm will say that
537   "this is not in RandALCat" and refuse to encode the string.
538
5394.2.  Yiddish
540
541   Yiddish is one of several languages written with the Hebrew script
542   (others include Hebrew and Ladino).  This is basically a consonantal
543   alphabet (also termed an "abjad"), but Yiddish is written using an
544   extended form that is fully vocalic.  The vowels are indicated in
545   several ways, one of which is by repurposing letters that are
546   consonants in Hebrew.  Other letters are used both as vowels and
547   consonants, with combining marks, called "points", used to
548   differentiate between them.  Finally, some base characters can
549   indicate several different vowels, which are also disambiguated by
550   combining marks.  Pointed characters can appear in word-final
551   position and may therefore also be needed at the end of labels.  This
552   is not an invariable attribute of a Yiddish string and there is thus
553   greater latitude here than there is with Dhivehi.
554
555   The organization now known as the "YIVO Institute for Jewish
556   Research" developed orthographic rules for modern Standard Yiddish
557   during the 1930s on the basis of work conducted in several venues
558   since earlier in that century.  These are given in, "The Standardized
559
560
561
562Alvestrand & Karp            Standards Track                   [Page 10]
563
564RFC 5893                   IDNA Right to Left                August 2010
565
566
567   Yiddish Orthography: Rules of Yiddish Spelling" [SYO], and are taken
568   as normatively descriptive of modern Standard Yiddish in any context
569   where that notion is deemed relevant.  They have been applied
570   exclusively in all formal Yiddish dictionaries published since their
571   establishment, and are similarly dominant in academic and
572   bibliographic regards.
573
574   It therefore appears appropriate for this repertoire also to be
575   supported fully by IDNA.  This presents no difficulty with characters
576   in initial and medial positions, but pointed characters are regularly
577   used in final position as well.  All of the characters in the SYO
578   repertoire appear in both marked and unmarked form with one
579   exception: the HEBREW LETTER PE (U+05E4).  The SYO only permits this
580   with a HEBREW POINT DAGESH (U+05BC), providing the Yiddish equivalent
581   to the Latin letter "p", or a HEBREW POINT RAFE (U+05BF), equivalent
582   to the Latin letter "f".  There is, however, a separate unpointed
583   allograph, the HEBREW LETTER FINAL PE (U+05E3), for the latter
584   character when it appears in final position.  The constraint on the
585   use of the SYO repertoire resulting from the proscription of
586   combining marks at the end of RTL strings thus reduces to nothing
587   more, or less, than the equivalent of saying that a string of Latin
588   characters cannot end with the letter "p".  It must also be noted
589   that the HEBREW LETTER PE with the HEBREW POINT DAGESH is
590   characteristic of almost all traditional Yiddish orthographies that
591   predate (or remain in use in parallel to) the SYO, being the first
592   pointed character to appear in any of them.
593
594   A more general instantiation of the basic problem can be seen in the
595   representation of the YIVO acronym.  This acronym is written with the
596   Hebrew letters YOD YOD HIRIQ VAV VAV ALEF QAMATS, where HIRIQ and
597   QAMATS are combining points.  The Unicode code points are:
598
599      U+05D9 HEBREW LETTER YOD (R)
600
601      U+05B4 HEBREW POINT HIRIQ (NSM)
602
603      U+05D5 HEBREW LETTER VAV (R)
604
605      U+05D0 HEBREW LETTER ALEF (R)
606
607      U+05B8 HEBREW POINT QAMATS (NSM)
608
609   The directionality class of U+05B8 HEBREW POINT QAMATS in the Unicode
610   database is NSM, which again causes the IDNA2003 algorithm to reject
611   the string.
612
613
614
615
616
617
618Alvestrand & Karp            Standards Track                   [Page 11]
619
620RFC 5893                   IDNA Right to Left                August 2010
621
622
623   It may also be noted that all of the combined characters mentioned
624   above exist in precomposed form at separate positions in the Unicode
625   chart.  However, by invoking Stringprep, the IDNA2003 algorithm also
626   rejects those code points, for reasons not discussed here.
627
6284.3.  Strings with Numbers
629
630   By requiring that the first or last character of a string be a member
631   of category R or AL, the Stringprep specification [RFC3454]
632   prohibited a string containing right-to-left characters from ending
633   with a number.
634
635   Consider the strings ALEF 5 (HEBREW LETTER ALEF + DIGIT FIVE) and 5
636   ALEF.  Displayed in an LTR context, the first one will be displayed
637   from left to right as 5 ALEF (with the 5 being considered right to
638   left because of the leading ALEF), while 5 ALEF will be displayed in
639   exactly the same order (5 taking the direction from context).
640   Clearly, only one of those should be permitted as a registered label,
641   but barring them both seems unnecessary.
642
6435.  Troublesome Situations and Guidelines
644
645   There are situations in which labels that satisfy the rule above will
646   be displayed in a surprising fashion.  The most important of these is
647   the case where a label ending in a character with Bidi property AL,
648   AN, or R occurs before a label beginning with a character of Bidi
649   property EN.  In that case, the number will appear to move into the
650   label containing the right-to-left character, violating the Character
651   Grouping requirement.
652
653   If the label that occurs after the right-to-left label itself
654   satisfies the Bidi criterion, the requirements will be satisfied in
655   all cases (this is the reason why the criterion talks about strings
656   containing L in some cases).  However, the IDNABIS WG concluded that
657   this could not be required for several reasons:
658
659   o  There is a large current deployment of ASCII domain names starting
660      with digits.  These cannot possibly be invalidated.
661
662   o  Domain names are often constructed piecemeal, for instance, by
663      combining a string with the content of a search list.  This may
664      occur after IDNA processing, and thus in part of the code that is
665      not IDNA-aware, making detection of the undesirable combination
666      impossible.
667
668
669
670
671
672
673
674Alvestrand & Karp            Standards Track                   [Page 12]
675
676RFC 5893                   IDNA Right to Left                August 2010
677
678
679   o  Even if a label is registered under a "safe" label, there may be a
680      DNAME [RFC2672] with an "unsafe" label that points to the "safe"
681      label, thus creating seemingly valid names that would not satisfy
682      the criterion.
683
684   o  Wildcards create the odd situation where a label is "valid" (can
685      be looked up successfully) without the zone owner knowing that
686      this label exists.  So an owner of a zone whose name starts with a
687      digit and contains a wildcard has no way of controlling whether or
688      not names with RTL labels in them are looked up in his zone.
689
690   Rather than trying to suggest rules that disallow all such
691   undesirable situations, this document merely warns about the
692   possibility, and leaves it to application developers to take whatever
693   measures they deem appropriate to avoid problematic situations.
694
6956.  Other Issues in Need of Resolution
696
697   This document concerns itself only with the rules that are needed
698   when dealing with domain names with characters that have differing
699   Bidi properties, and considers characters only in terms of their Bidi
700   properties.  All other issues with scripts that are written from
701   right to left must be considered in other contexts.
702
703   One such issue is the need to keep numbers separate.  Several scripts
704   are used with multiple sets of numbers -- most commonly they use
705   Latin numbers and a script-specific set of numbers, but in the case
706   of Arabic, there are two sets of "Arabic-Indic" digits involved.
707
708   The algorithm in this document disallows occurrences of AN-class
709   characters ("Arabic-Indic digits", U+0660 to U+0669) together with
710   EN-class characters (which includes "European" digits, U+0030 to
711   U+0039 and "extended Arabic-Indic digits", U+06F0 to U+06F9), but
712   does not help in preventing the mixing of, for instance, Bengali
713   digits (U+09E6 to U+09EF) and Gujarati digits (U+0AE6 to U+0AEF),
714   both of which have Bidi class L.  A registry or script community that
715   wishes to create rules restricting the mixing of digits in a label
716   will be able to specify these restrictions at the registry level.
717   Some rules are also specified at the protocol level.
718
719   Another set of issues concerns the proper display of IDNs with a
720   mixture of LTR and RTL labels, or only RTL labels.
721
722   It is unrealistic to expect that applications will display domain
723   names using embedded formatting codes between their labels (for one
724   thing, no reliable algorithms for identifying domain names in running
725   text exist); thus, the display order will be determined by the Bidi
726   algorithm.  Thus, a sequence (in network order) of R1.R2.ltr will be
727
728
729
730Alvestrand & Karp            Standards Track                   [Page 13]
731
732RFC 5893                   IDNA Right to Left                August 2010
733
734
735   displayed in the order 2R.1R.ltr in an LTR context, which might
736   surprise someone expecting to see labels displayed in hierarchical
737   order.  People used to working with text that mixes LTR and RTL
738   strings might not be so surprised by this.  Again, this memo does not
739   attempt to suggest a solution to this problem.
740
7417.  Compatibility Considerations
742
7437.1.  Backwards Compatibility Considerations
744
745   As with any change to an existing standard, it is important to
746   consider what happens with existing implementations when the change
747   is introduced.  Some troublesome cases include:
748
749   o  An old program used to input the newly allowed label.  If the old
750      program checks the input against RFC 3454, some labels will not be
751      allowed, and domain names containing those labels will remain
752      inaccessible.
753
754   o  An old program is asked to display the newly allowed label, and
755      checks it against RFC 3454 before displaying.  The program will
756      perform some kind of fallback, most likely displaying the label in
757      A-label form.
758
759   o  An old program tries to display the newly allowed label.  If the
760      old program has code for displaying the last character of a label
761      that is different from the code used to display the characters in
762      the middle of the label, the display may be inconsistent and cause
763      confusion.
764
765   One particular example of the last case is if a program chooses to
766   examine the last character (in network order) of a string in order to
767   determine its directionality, rather than its first.  If it finds an
768   NSM character and tries to display the string as if it was a
769   left-to-right string, the resulting display may be interesting, but
770   not useful.
771
772   The editors believe that these cases will have a less harmful impact
773   in practice than continuing to deny the use of words from the
774   languages for which these strings are necessary as IDN labels.
775
776   This specification does not forbid using leading European digits in
777   ASCII-only labels, since this would conflict with a large installed
778   base of such labels, and would increase the scope of the
779   specification from RTL labels to all labels.  The harm resulting from
780   this limitation of scope is described in Section 5.  Registries and
781   private zone managers can check for this particular condition before
782   they allow registration of any RTL label.  Generally, it is best to
783
784
785
786Alvestrand & Karp            Standards Track                   [Page 14]
787
788RFC 5893                   IDNA Right to Left                August 2010
789
790
791   disallow registration of any right-to-left strings in a zone where
792   the label at the level above begins with a digit.
793
7947.2.  Forward Compatibility Considerations
795
796   This text is intentionally specified strictly in terms of the Unicode
797   Bidi properties.  The determination that the condition is sufficient
798   to fulfill the criteria depends on the Unicode Bidi algorithm; it is
799   unlikely that drastic changes will be made to this algorithm.
800
801   However, the determination of validity for any string depends on the
802   Unicode Bidi property values, which are not declared immutable by the
803   Unicode Consortium.  Furthermore, the behavior of the algorithm for
804   any given character is likely to be linguistically and culturally
805   sensitive, so while it should occur rarely, it is possible that later
806   versions of the Unicode Standard may change the Bidi properties
807   assigned to certain Unicode characters.
808
809   This memo does not propose a solution for this problem.
810
8118.  Security Considerations
812
813   The display behavior of mixed-direction text can be extremely
814   surprising to users who are not used to it; for instance, cut and
815   paste of a piece of text can cause the text to display differently at
816   the destination, if the destination is in another directionality
817   context, and adding a character in one place of a text can cause
818   characters some distance from the point of insertion to change their
819   display position.  This is, however, not a phenomenon unique to the
820   display of domain names.
821
822   The new IDNA protocol, and particularly these new Bidi rules, will
823   allow some strings to be used in IDNA contexts that are not allowed
824   today.  It is possible that differences in the interpretation of
825   labels between implementations of IDNA2003 and IDNA2008 could pose a
826   security risk, but it is difficult to envision any specific
827   instantiation of this.
828
829   Any rational attempt to compute, for instance, a hash over an
830   identifier processed by IDNA would use network order for its
831   computation, and thus be unaffected by the new rules proposed here.
832
833   While it is not believed to pose a problem, if display routines had
834   been written with specific knowledge of the RFC 3454 IDNA
835   prohibitions, it is possible that the potential problems noted under
836   "Backwards Compatibility Considerations" could cause new kinds of
837   confusion.
838
839
840
841
842Alvestrand & Karp            Standards Track                   [Page 15]
843
844RFC 5893                   IDNA Right to Left                August 2010
845
846
8479.  Acknowledgements
848
849   While the listed editors held the pen, this document represents the
850   joint work and conclusions of an ad hoc design team.  In addition to
851   the editors, this consisted of, in alphabetic order, Tina Dam, Patrik
852   Faltstrom, and John Klensin.  Many further specific contributions and
853   helpful comments were received from the people listed below, and
854   others who have contributed to the development and use of the IDNA
855   protocols.
856
857   The particular formulation of the Bidi rule in Section 2 was
858   suggested by Matitiahu Allouche.
859
860   The team wishes, in particular, to thank Roozbeh Pournader for
861   calling its attention to the issue with the Thaana script, Paul
862   Hoffman for pointing out the need to be explicit about backwards
863   compatibility considerations, Ken Whistler for suggesting the basis
864   of the formalized "Character Grouping" requirement, Mark Davis for
865   commentary, Erik van der Poel for careful review, comments, and
866   verification of the rulesets, Marcos Sanz, Andrew Sullivan, and Pete
867   Resnick for reviews, and Vint Cerf for chairing the working group and
868   contributing massively to getting the documents finished.
869
87010.  References
871
87210.1.  Normative References
873
874   [RFC5890]      Klensin, J., "Internationalized Domain Names for
875                  Applications (IDNA): Definitions and Document
876                  Framework", RFC 5890, August 2010.
877
878   [Unicode-UAX9] The Unicode Consortium, "Unicode Standard Annex #9:
879                  Unicode Bidirectional Algorithm", September 2009,
880                  <http://www.unicode.org/reports/tr9/>.
881
882   [Unicode52]    The Unicode Consortium.  The Unicode Standard, Version
883                  5.2.0, defined by: "The Unicode Standard, Version
884                  5.2.0", (Mountain View, CA: The Unicode Consortium,
885                  2009. ISBN 978-1-936213-00-9).
886                  <http://www.unicode.org/versions/Unicode5.2.0/>.
887
888
889
890
891
892
893
894
895
896
897
898Alvestrand & Karp            Standards Track                   [Page 16]
899
900RFC 5893                   IDNA Right to Left                August 2010
901
902
90310.2.  Informative References
904
905   [RFC2672]      Crawford, M., "Non-Terminal DNS Name Redirection",
906                  RFC 2672, August 1999.
907
908   [RFC3454]      Hoffman, P. and M. Blanchet, "Preparation of
909                  Internationalized Strings ("stringprep")", RFC 3454,
910                  December 2002.
911
912   [RFC5891]      Klensin, J., "Internationalized Domain Names in
913                  Applications (IDNA): Protocol", RFC 5891, August 2010.
914
915   [SYO]          "The Standardized Yiddish Orthography: Rules of
916                  Yiddish Spelling, 6th ed., New York, ISBN
917                  0-914512-25-0", 1999.
918
919Authors' Addresses
920
921   Harald Tveit Alvestrand (editor)
922   Google
923   Beddingen 10
924   Trondheim,   7014
925   Norway
926
927   EMail: harald@alvestrand.no
928
929
930   Cary Karp
931   Swedish Museum of Natural History
932   Frescativ. 40
933   Stockholm,   10405
934   Sweden
935
936   Phone: +46 8 5195 4055
937   Fax:
938   EMail: ck@nic.museum
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954Alvestrand & Karp            Standards Track                   [Page 17]
955