Punycode (RFC3492) in OCaml
at main 955 lines 39 kB view raw
1 2 3 4 5 6 7Internet Engineering Task Force (IETF) H. Alvestrand, Ed. 8Request for Comments: 5893 Google 9Category: Standards Track C. Karp 10ISSN: 2070-1721 Swedish Museum of Natural History 11 August 2010 12 13 14 Right-to-Left Scripts for 15 Internationalized Domain Names for Applications (IDNA) 16 17Abstract 18 19 The use of right-to-left scripts in Internationalized Domain Names 20 (IDNs) has presented several challenges. This memo provides a new 21 Bidi rule for Internationalized Domain Names for Applications (IDNA) 22 labels, based on the encountered problems with some scripts and some 23 shortcomings in the 2003 IDNA Bidi criterion. 24 25Status of This Memo 26 27 This is an Internet Standards Track document. 28 29 This document is a product of the Internet Engineering Task Force 30 (IETF). It represents the consensus of the IETF community. It has 31 received public review and has been approved for publication by the 32 Internet Engineering Steering Group (IESG). Further information on 33 Internet Standards is available in Section 2 of RFC 5741. 34 35 Information about the current status of this document, any errata, 36 and how to provide feedback on it may be obtained at 37 http://www.rfc-editor.org/info/rfc5893. 38 39Copyright Notice 40 41 Copyright (c) 2010 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 43 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 53 54 55 56 57 58Alvestrand & Karp Standards Track [Page 1] 59 60RFC 5893 IDNA Right to Left August 2010 61 62 63Table of Contents 64 65 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2 66 1.1. Purpose and Applicability . . . . . . . . . . . . . . . . 2 67 1.2. Background and History . . . . . . . . . . . . . . . . . . 3 68 1.3. Structure of the Rest of This Document . . . . . . . . . . 3 69 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 70 2. The Bidi Rule . . . . . . . . . . . . . . . . . . . . . . . . 6 71 3. The Requirement Set for the Bidi Rule . . . . . . . . . . . . 6 72 4. Examples of Issues Found with RFC 3454 . . . . . . . . . . . . 9 73 4.1. Dhivehi . . . . . . . . . . . . . . . . . . . . . . . . . 9 74 4.2. Yiddish . . . . . . . . . . . . . . . . . . . . . . . . . 10 75 4.3. Strings with Numbers . . . . . . . . . . . . . . . . . . . 12 76 5. Troublesome Situations and Guidelines . . . . . . . . . . . . 12 77 6. Other Issues in Need of Resolution . . . . . . . . . . . . . . 13 78 7. Compatibility Considerations . . . . . . . . . . . . . . . . . 14 79 7.1. Backwards Compatibility Considerations . . . . . . . . . . 14 80 7.2. Forward Compatibility Considerations . . . . . . . . . . . 15 81 8. Security Considerations . . . . . . . . . . . . . . . . . . . 15 82 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 16 83 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16 84 10.1. Normative References . . . . . . . . . . . . . . . . . . . 16 85 10.2. Informative References . . . . . . . . . . . . . . . . . . 17 86 871. Introduction 88 891.1. Purpose and Applicability 90 91 The purpose of this document is to establish a rule that can be 92 applied to Internationalized Domain Name (IDN) labels in Unicode form 93 (U-labels) containing characters from scripts that are written from 94 right to left. It is part of the revised IDNA protocol [RFC5891]. 95 96 When labels satisfy the rule, and when certain other conditions are 97 satisfied, there is only a minimal chance of these labels being 98 displayed in a confusing way by the Unicode bidirectional display 99 algorithm. 100 101 The other normative documents in the IDNA2008 document set establish 102 criteria for valid labels, including listing the permitted 103 characters. This document establishes additional validity criteria 104 for labels in scripts normally written from right to left. 105 106 This specification is not intended to place any requirements on 107 domain names that do not contain characters from such scripts. 108 109 110 111 112 113 114Alvestrand & Karp Standards Track [Page 2] 115 116RFC 5893 IDNA Right to Left August 2010 117 118 1191.2. Background and History 120 121 The "Stringprep" specification [RFC3454], part of IDNA2003, made the 122 following statement in its Section 6 on the Bidi algorithm: 123 124 3) If a string contains any RandALCat character, a RandALCat 125 character MUST be the first character of the string, and a 126 RandALCat character MUST be the last character of the string. 127 128 (A RandALCat character is a character with unambiguously 129 right-to-left directionality.) 130 131 The reasoning behind this prohibition was to ensure that every 132 component of a displayed domain name has an unambiguously preferred 133 direction. However, this made certain words in languages written 134 with right-to-left scripts invalid as IDN labels, and in at least one 135 case (Dhivehi) meant that all the words of an entire language were 136 forbidden as IDN labels. 137 138 This is illustrated below with examples taken from the Dhivehi and 139 Yiddish languages, as written with the Thaana and Hebrew scripts, 140 respectively. 141 142 RFC 3454 did not explicitly state the requirement to be fulfilled. 143 Therefore, it is impossible to determine whether a simple relaxation 144 of the rule would continue to fulfill the requirement. 145 146 While this document specifies rules quite different from RFC 3454, 147 most reasonable labels that were allowed under RFC 3454 will also be 148 allowed under this specification (the most important example of 149 non-permitted labels being labels that mix Arabic and European digits 150 (AN and EN) inside an RTL label, and labels that use AN in an LTR 151 label -- see Section 1.4 for terminology), so the operational impact 152 of using the new rule in the updated IDNA specification is limited. 153 1541.3. Structure of the Rest of This Document 155 156 Section 2 defines a rule, the "Bidi rule", which can be used on a 157 domain name label to check how safe it is to use in a domain name of 158 possibly mixed directionality. The primary initial use of this rule 159 is as part of the IDNA2008 protocol [RFC5891]. 160 161 Section 3 sets out the requirements for defining the Bidi rule. 162 163 Section 4 gives detailed examples that serve as justification for the 164 new rule. 165 166 167 168 169 170Alvestrand & Karp Standards Track [Page 3] 171 172RFC 5893 IDNA Right to Left August 2010 173 174 175 Section 5 to Section 8 describe various situations that can occur 176 when dealing with domain names with characters of different 177 directionality. 178 179 Only Section 1.4 and Section 2 are normative. 180 1811.4. Terminology 182 183 The terminology used to describe IDNA concepts is defined in the 184 Definitions document [RFC5890]. 185 186 The terminology used for the Bidi properties of Unicode characters is 187 taken from the Unicode Standard [Unicode52]. 188 189 The Unicode Standard specifies a Bidi property for each character. 190 That property controls the character's behavior in the Unicode 191 bidirectional algorithm [Unicode-UAX9]. For reference, here are the 192 values that the Unicode Bidi property can have: 193 194 o L - Left to right - most letters in LTR scripts 195 196 o R - Right to left - most letters in non-Arabic RTL scripts 197 198 o AL - Arabic letters - most letters in the Arabic script 199 200 o EN - European Number (0-9, and Extended Arabic-Indic numbers) 201 202 o ES - European Number Separator (+ and -) 203 204 o ET - European Number Terminator (currency symbols, the hash sign, 205 the percent sign and so on) 206 207 o AN - Arabic Number; this encompasses the Arabic-Indic numbers, but 208 not the Extended Arabic-Indic numbers 209 210 o CS - Common Number Separator (. , / : et al) 211 212 o NSM - Nonspacing Mark - most combining accents 213 214 o BN - Boundary Neutral - control characters (ZWNJ, ZWJ, and others) 215 216 o B - Paragraph Separator 217 218 o S - Segment Separator 219 220 o WS - Whitespace, including the SPACE character 221 222 o ON - Other Neutrals, including @, &, parentheses, MIDDLE DOT 223 224 225 226Alvestrand & Karp Standards Track [Page 4] 227 228RFC 5893 IDNA Right to Left August 2010 229 230 231 o LRE, LRO, RLE, RLO, PDF - these are "directional control 232 characters" and are not used in IDNA labels. 233 234 In this memo, we use "network order" to describe the sequence of 235 characters as transmitted on the wire or stored in a file; the terms 236 "first", "next", "previous", "beginning", "end", "before", and 237 "after" are used to refer to the relationship of characters and 238 labels in network order. 239 240 We use "display order" to talk about the sequence of characters as 241 imaged on a display medium; the terms "left" and "right" are used to 242 refer to the relationship of characters and labels in display order. 243 244 Most of the time, the examples use the abbreviations for the Unicode 245 Bidi classes to denote the directionality of the characters; the 246 example string CS L consists of one character of class CS and one 247 character of class L. In some examples, the convention that 248 uppercase characters are of class R or AL, and lowercase characters 249 are of class L is used -- thus, the example string ABC.abc would 250 consist of three right-to-left characters and three left-to-right 251 characters. 252 253 The directionality of such examples is determined by context -- for 254 instance, in the sentence "ABC.abc is displayed as CBA.abc", the 255 first example string is in network order, the second example string 256 is in display order. 257 258 The term "paragraph" is used in the sense of the Unicode Bidi 259 specification [Unicode-UAX9]. It means "a block of text that has an 260 overall direction, either left to right or right to left", 261 approximately; see the "Unicode Bidirectional Algorithm" 262 [Unicode-UAX9] for details. 263 264 "RTL" and "LTR" are abbreviations for "right to left" and "left to 265 right", respectively. 266 267 An RTL label is a label that contains at least one character of type 268 R, AL, or AN. 269 270 An LTR label is any label that is not an RTL label. 271 272 A "Bidi domain name" is a domain name that contains at least one RTL 273 label. (Note: This definition includes domain names containing only 274 dots and right-to-left characters. Providing a separate category of 275 "RTL domain names" would not make this specification simpler, so it 276 has not been done.) 277 278 279 280 281 282Alvestrand & Karp Standards Track [Page 5] 283 284RFC 5893 IDNA Right to Left August 2010 285 286 2872. The Bidi Rule 288 289 The following rule, consisting of six conditions, applies to labels 290 in Bidi domain names. The requirements that this rule satisfies are 291 described in Section 3. All of the conditions must be satisfied for 292 the rule to be satisfied. 293 294 1. The first character must be a character with Bidi property L, R, 295 or AL. If it has the R or AL property, it is an RTL label; if it 296 has the L property, it is an LTR label. 297 298 2. In an RTL label, only characters with the Bidi properties R, AL, 299 AN, EN, ES, CS, ET, ON, BN, or NSM are allowed. 300 301 3. In an RTL label, the end of the label must be a character with 302 Bidi property R, AL, EN, or AN, followed by zero or more 303 characters with Bidi property NSM. 304 305 4. In an RTL label, if an EN is present, no AN may be present, and 306 vice versa. 307 308 5. In an LTR label, only characters with the Bidi properties L, EN, 309 ES, CS, ET, ON, BN, or NSM are allowed. 310 311 6. In an LTR label, the end of the label must be a character with 312 Bidi property L or EN, followed by zero or more characters with 313 Bidi property NSM. 314 315 The following guarantees can be made based on the above: 316 317 o In a domain name consisting of only labels that satisfy the rule, 318 the requirements of Section 3 are satisfied. Note that even LTR 319 labels and pure ASCII labels have to be tested. 320 321 o In a domain name consisting of only LDH labels (as defined in the 322 Definitions document [RFC5890]) and labels that satisfy the rule, 323 the requirements of Section 3 are satisfied as long as a label 324 that starts with an ASCII digit does not come after a 325 right-to-left label. 326 327 No guarantee is given for other combinations. 328 3293. The Requirement Set for the Bidi Rule 330 331 This document, unlike RFC 3454 [RFC3454], provides an explicit 332 justification for the Bidi rule, and states a set of requirements for 333 which it is possible to test whether or not the modified rule 334 fulfills the requirement. 335 336 337 338Alvestrand & Karp Standards Track [Page 6] 339 340RFC 5893 IDNA Right to Left August 2010 341 342 343 All the text in this document assumes that text containing the labels 344 under consideration will be displayed using the Unicode bidirectional 345 algorithm [Unicode-UAX9]. 346 347 The requirements proposed are these: 348 349 o Label Uniqueness: No two labels, when presented in display order 350 in the same paragraph, should have the same sequence of characters 351 without also having the same sequence of characters in network 352 order, both when the paragraph has LTR direction and when the 353 paragraph has RTL direction. (This is the criterion that is 354 explicit in RFC 3454). (Note that a label displayed in an RTL 355 paragraph may display the same as a different label displayed in 356 an LTR paragraph and still satisfy this criterion.) 357 358 o Character Grouping: When displaying a string of labels, using the 359 Unicode Bidi algorithm to reorder the characters for display, the 360 characters of each label should remain grouped between the 361 characters delimiting the labels, both when the string is embedded 362 in a paragraph with LTR direction and when it is embedded in a 363 paragraph with RTL direction. 364 365 Several stronger statements were considered and rejected, because 366 they seem to be impossible to fulfill within the constraints of the 367 Unicode bidirectional algorithm. These include: 368 369 o The appearance of a label should be unaffected by its embedding 370 context. This proved impossible even for ASCII labels; the label 371 "123-A" will have a different display order in an RTL context than 372 in an LTR context. (This particular example is, however, 373 disallowed anyway.) 374 375 o The sequence of labels should be consistent with network order. 376 This proved impossible -- a domain name consisting of the labels 377 (in network order) L1.R2.R3.L4 will be displayed as L1.R3.R2.L4 in 378 an LTR context. (In an RTL context, it will be displayed as 379 L4.R3.R2.L1). 380 381 o No two domain names should be displayed the same, even under 382 differing directionality. This was shown to be unsound, since the 383 domain name (in network order) ABC.abc will have display order 384 CBA.abc in an LTR context and abc.CBA in an RTL context, while the 385 domain name (network) abc.ABC will have display order abc.CBA in 386 an LTR context and CBA.abc in an RTL context. 387 388 389 390 391 392 393 394Alvestrand & Karp Standards Track [Page 7] 395 396RFC 5893 IDNA Right to Left August 2010 397 398 399 One possible requirement was thought to be problematic, but turned 400 out to be satisfied by a string that obeys the proposed rules: 401 402 o The Character Grouping requirement should be satisfied when 403 directional controls (LRE, RLE, RLO, LRO, PDF) are used in the 404 same paragraph (outside of the labels). Because these controls 405 affect presentation order in non-obvious ways, by affecting the 406 "sor" and "eor" properties of the Unicode Bidi algorithm, the 407 conditions above require extra testing in order to figure out 408 whether or not they influence the display of the domain name. 409 Testing found that for the strings allowed under the rule 410 presented in this document, directional controls do not influence 411 the display of the domain name. 412 413 This is still not stated as a requirement, since it did not seem as 414 important as the stated requirements, but it is useful to know that 415 Bidi domain names where the labels satisfy the rule have this 416 property. 417 418 In the following descriptions, first-level bullets are used to 419 indicate rules or normative statements; second-level bullets are 420 commentary. 421 422 The Character Grouping requirement can be more formally stated as: 423 424 o Let "Delimiterchars" be a set of characters with the Unicode Bidi 425 properties CS, WS, ON. (These are commonly used to delimit labels 426 -- both the FULL STOP and the space are included. They are not 427 allowed in domain labels.) 428 429 * ET, though it commonly occurs next to domain names in practice, 430 is problematic: the context R CS L EN ET (for instance A.a1%) 431 makes the label L EN not satisfy the character grouping 432 requirement. 433 434 * ES commonly occurs in labels as HYPHEN-MINUS, but could also be 435 used as a delimiter (for instance, the plus sign). It is left 436 out here. 437 438 o Let "unproblematic label" be a label that either satisfies the 439 requirements or does not contain any character with the Bidi 440 properties R, AL, or AN and does not begin with a character with 441 the Bidi property EN. (Informally, "it does not start with a 442 number".) 443 444 445 446 447 448 449 450Alvestrand & Karp Standards Track [Page 8] 451 452RFC 5893 IDNA Right to Left August 2010 453 454 455 A label X satisfies the Character Grouping requirement when, for any 456 Delimiter Character D1 and D2, and for any label S1 and S2 that is an 457 unproblematic label or an empty string, the following holds true: 458 459 If the string formed by concatenating S1, D1, X, D2, and S2 is 460 reordered according to the Bidi algorithm, then all the characters of 461 X in the reordered string are between D1 and D2, and no other 462 characters are between D1 and D2, both if the overall paragraph 463 direction is LTR and if the overall paragraph direction is RTL. 464 465 Note that the definition is self-referential, since S1 and S2 are 466 constrained to be "legal" by this definition. This makes testing 467 changes to proposed rules a little complex, but does not create 468 problems for testing whether or not a given proposed rule satisfies 469 the criterion. 470 471 The "zero-length" case represents the case where a domain name is 472 next to something that isn't a domain name, separated by a delimiter 473 character. 474 475 Note about the position of BN: The Unicode bidirectional algorithm 476 specifies that a BN has an effect on the adjoining characters in 477 network order, not in display order, and are therefore treated as if 478 removed during Bidi processing ([Unicode-UAX9], Section 3.3.2, rule 479 X9 and Section 5.3). Therefore, the question of "what position does 480 a BN have after reordering" is not meaningful. It has been ignored 481 while developing the rules here. 482 483 The Label Uniqueness requirement can be formally stated as: 484 485 If two non-identical labels X and Y, embedded as for the test above, 486 displayed in paragraphs with the same directionality, are reordered 487 by the Bidi algorithm into the same sequence of code points, the 488 labels X and Y cannot both be legal. 489 4904. Examples of Issues Found with RFC 3454 491 4924.1. Dhivehi 493 494 Dhivehi, the official language of the Maldives, is written with the 495 Thaana script. This script displays some of the characteristics of 496 the Arabic script, including its directional properties, and the 497 indication of vowels by the diacritical marking of consonantal base 498 characters. This marking is obligatory, and both two consecutive 499 vowels and syllable-final consonants are indicated with unvoiced 500 combining marks. Every Dhivehi word therefore ends with a combining 501 mark. 502 503 504 505 506Alvestrand & Karp Standards Track [Page 9] 507 508RFC 5893 IDNA Right to Left August 2010 509 510 511 The word for "computer", which is romanized as "konpeetaru", is 512 written with the following sequence of Unicode code points: 513 514 U+0786 THAANA LETTER KAAFU (AL) 515 516 U+07AE THAANA OBOFILI (NSM) 517 518 U+0782 THAANA LETTER NOONU (AL) 519 520 U+07B0 THAANA SUKUN (NSM) 521 522 U+0795 THAANA LETTER PAVIYANI (AL) 523 524 U+07A9 THAANA LETTER EEBEEFILI (AL) 525 526 U+0793 THAANA LETTER TAVIYANI (AL) 527 528 U+07A6 THAANA ABAFILI (NSM) 529 530 U+0783 THAANA LETTER RAA (AL) 531 532 U+07AA THAANA UBUFILI (NSM) 533 534 The directionality class of U+07AA in the Unicode database 535 [Unicode52] is NSM (Nonspacing Mark), which is not R or AL; a 536 conformant implementation of the IDNA2003 algorithm will say that 537 "this is not in RandALCat" and refuse to encode the string. 538 5394.2. Yiddish 540 541 Yiddish is one of several languages written with the Hebrew script 542 (others include Hebrew and Ladino). This is basically a consonantal 543 alphabet (also termed an "abjad"), but Yiddish is written using an 544 extended form that is fully vocalic. The vowels are indicated in 545 several ways, one of which is by repurposing letters that are 546 consonants in Hebrew. Other letters are used both as vowels and 547 consonants, with combining marks, called "points", used to 548 differentiate between them. Finally, some base characters can 549 indicate several different vowels, which are also disambiguated by 550 combining marks. Pointed characters can appear in word-final 551 position and may therefore also be needed at the end of labels. This 552 is not an invariable attribute of a Yiddish string and there is thus 553 greater latitude here than there is with Dhivehi. 554 555 The organization now known as the "YIVO Institute for Jewish 556 Research" developed orthographic rules for modern Standard Yiddish 557 during the 1930s on the basis of work conducted in several venues 558 since earlier in that century. These are given in, "The Standardized 559 560 561 562Alvestrand & Karp Standards Track [Page 10] 563 564RFC 5893 IDNA Right to Left August 2010 565 566 567 Yiddish Orthography: Rules of Yiddish Spelling" [SYO], and are taken 568 as normatively descriptive of modern Standard Yiddish in any context 569 where that notion is deemed relevant. They have been applied 570 exclusively in all formal Yiddish dictionaries published since their 571 establishment, and are similarly dominant in academic and 572 bibliographic regards. 573 574 It therefore appears appropriate for this repertoire also to be 575 supported fully by IDNA. This presents no difficulty with characters 576 in initial and medial positions, but pointed characters are regularly 577 used in final position as well. All of the characters in the SYO 578 repertoire appear in both marked and unmarked form with one 579 exception: the HEBREW LETTER PE (U+05E4). The SYO only permits this 580 with a HEBREW POINT DAGESH (U+05BC), providing the Yiddish equivalent 581 to the Latin letter "p", or a HEBREW POINT RAFE (U+05BF), equivalent 582 to the Latin letter "f". There is, however, a separate unpointed 583 allograph, the HEBREW LETTER FINAL PE (U+05E3), for the latter 584 character when it appears in final position. The constraint on the 585 use of the SYO repertoire resulting from the proscription of 586 combining marks at the end of RTL strings thus reduces to nothing 587 more, or less, than the equivalent of saying that a string of Latin 588 characters cannot end with the letter "p". It must also be noted 589 that the HEBREW LETTER PE with the HEBREW POINT DAGESH is 590 characteristic of almost all traditional Yiddish orthographies that 591 predate (or remain in use in parallel to) the SYO, being the first 592 pointed character to appear in any of them. 593 594 A more general instantiation of the basic problem can be seen in the 595 representation of the YIVO acronym. This acronym is written with the 596 Hebrew letters YOD YOD HIRIQ VAV VAV ALEF QAMATS, where HIRIQ and 597 QAMATS are combining points. The Unicode code points are: 598 599 U+05D9 HEBREW LETTER YOD (R) 600 601 U+05B4 HEBREW POINT HIRIQ (NSM) 602 603 U+05D5 HEBREW LETTER VAV (R) 604 605 U+05D0 HEBREW LETTER ALEF (R) 606 607 U+05B8 HEBREW POINT QAMATS (NSM) 608 609 The directionality class of U+05B8 HEBREW POINT QAMATS in the Unicode 610 database is NSM, which again causes the IDNA2003 algorithm to reject 611 the string. 612 613 614 615 616 617 618Alvestrand & Karp Standards Track [Page 11] 619 620RFC 5893 IDNA Right to Left August 2010 621 622 623 It may also be noted that all of the combined characters mentioned 624 above exist in precomposed form at separate positions in the Unicode 625 chart. However, by invoking Stringprep, the IDNA2003 algorithm also 626 rejects those code points, for reasons not discussed here. 627 6284.3. Strings with Numbers 629 630 By requiring that the first or last character of a string be a member 631 of category R or AL, the Stringprep specification [RFC3454] 632 prohibited a string containing right-to-left characters from ending 633 with a number. 634 635 Consider the strings ALEF 5 (HEBREW LETTER ALEF + DIGIT FIVE) and 5 636 ALEF. Displayed in an LTR context, the first one will be displayed 637 from left to right as 5 ALEF (with the 5 being considered right to 638 left because of the leading ALEF), while 5 ALEF will be displayed in 639 exactly the same order (5 taking the direction from context). 640 Clearly, only one of those should be permitted as a registered label, 641 but barring them both seems unnecessary. 642 6435. Troublesome Situations and Guidelines 644 645 There are situations in which labels that satisfy the rule above will 646 be displayed in a surprising fashion. The most important of these is 647 the case where a label ending in a character with Bidi property AL, 648 AN, or R occurs before a label beginning with a character of Bidi 649 property EN. In that case, the number will appear to move into the 650 label containing the right-to-left character, violating the Character 651 Grouping requirement. 652 653 If the label that occurs after the right-to-left label itself 654 satisfies the Bidi criterion, the requirements will be satisfied in 655 all cases (this is the reason why the criterion talks about strings 656 containing L in some cases). However, the IDNABIS WG concluded that 657 this could not be required for several reasons: 658 659 o There is a large current deployment of ASCII domain names starting 660 with digits. These cannot possibly be invalidated. 661 662 o Domain names are often constructed piecemeal, for instance, by 663 combining a string with the content of a search list. This may 664 occur after IDNA processing, and thus in part of the code that is 665 not IDNA-aware, making detection of the undesirable combination 666 impossible. 667 668 669 670 671 672 673 674Alvestrand & Karp Standards Track [Page 12] 675 676RFC 5893 IDNA Right to Left August 2010 677 678 679 o Even if a label is registered under a "safe" label, there may be a 680 DNAME [RFC2672] with an "unsafe" label that points to the "safe" 681 label, thus creating seemingly valid names that would not satisfy 682 the criterion. 683 684 o Wildcards create the odd situation where a label is "valid" (can 685 be looked up successfully) without the zone owner knowing that 686 this label exists. So an owner of a zone whose name starts with a 687 digit and contains a wildcard has no way of controlling whether or 688 not names with RTL labels in them are looked up in his zone. 689 690 Rather than trying to suggest rules that disallow all such 691 undesirable situations, this document merely warns about the 692 possibility, and leaves it to application developers to take whatever 693 measures they deem appropriate to avoid problematic situations. 694 6956. Other Issues in Need of Resolution 696 697 This document concerns itself only with the rules that are needed 698 when dealing with domain names with characters that have differing 699 Bidi properties, and considers characters only in terms of their Bidi 700 properties. All other issues with scripts that are written from 701 right to left must be considered in other contexts. 702 703 One such issue is the need to keep numbers separate. Several scripts 704 are used with multiple sets of numbers -- most commonly they use 705 Latin numbers and a script-specific set of numbers, but in the case 706 of Arabic, there are two sets of "Arabic-Indic" digits involved. 707 708 The algorithm in this document disallows occurrences of AN-class 709 characters ("Arabic-Indic digits", U+0660 to U+0669) together with 710 EN-class characters (which includes "European" digits, U+0030 to 711 U+0039 and "extended Arabic-Indic digits", U+06F0 to U+06F9), but 712 does not help in preventing the mixing of, for instance, Bengali 713 digits (U+09E6 to U+09EF) and Gujarati digits (U+0AE6 to U+0AEF), 714 both of which have Bidi class L. A registry or script community that 715 wishes to create rules restricting the mixing of digits in a label 716 will be able to specify these restrictions at the registry level. 717 Some rules are also specified at the protocol level. 718 719 Another set of issues concerns the proper display of IDNs with a 720 mixture of LTR and RTL labels, or only RTL labels. 721 722 It is unrealistic to expect that applications will display domain 723 names using embedded formatting codes between their labels (for one 724 thing, no reliable algorithms for identifying domain names in running 725 text exist); thus, the display order will be determined by the Bidi 726 algorithm. Thus, a sequence (in network order) of R1.R2.ltr will be 727 728 729 730Alvestrand & Karp Standards Track [Page 13] 731 732RFC 5893 IDNA Right to Left August 2010 733 734 735 displayed in the order 2R.1R.ltr in an LTR context, which might 736 surprise someone expecting to see labels displayed in hierarchical 737 order. People used to working with text that mixes LTR and RTL 738 strings might not be so surprised by this. Again, this memo does not 739 attempt to suggest a solution to this problem. 740 7417. Compatibility Considerations 742 7437.1. Backwards Compatibility Considerations 744 745 As with any change to an existing standard, it is important to 746 consider what happens with existing implementations when the change 747 is introduced. Some troublesome cases include: 748 749 o An old program used to input the newly allowed label. If the old 750 program checks the input against RFC 3454, some labels will not be 751 allowed, and domain names containing those labels will remain 752 inaccessible. 753 754 o An old program is asked to display the newly allowed label, and 755 checks it against RFC 3454 before displaying. The program will 756 perform some kind of fallback, most likely displaying the label in 757 A-label form. 758 759 o An old program tries to display the newly allowed label. If the 760 old program has code for displaying the last character of a label 761 that is different from the code used to display the characters in 762 the middle of the label, the display may be inconsistent and cause 763 confusion. 764 765 One particular example of the last case is if a program chooses to 766 examine the last character (in network order) of a string in order to 767 determine its directionality, rather than its first. If it finds an 768 NSM character and tries to display the string as if it was a 769 left-to-right string, the resulting display may be interesting, but 770 not useful. 771 772 The editors believe that these cases will have a less harmful impact 773 in practice than continuing to deny the use of words from the 774 languages for which these strings are necessary as IDN labels. 775 776 This specification does not forbid using leading European digits in 777 ASCII-only labels, since this would conflict with a large installed 778 base of such labels, and would increase the scope of the 779 specification from RTL labels to all labels. The harm resulting from 780 this limitation of scope is described in Section 5. Registries and 781 private zone managers can check for this particular condition before 782 they allow registration of any RTL label. Generally, it is best to 783 784 785 786Alvestrand & Karp Standards Track [Page 14] 787 788RFC 5893 IDNA Right to Left August 2010 789 790 791 disallow registration of any right-to-left strings in a zone where 792 the label at the level above begins with a digit. 793 7947.2. Forward Compatibility Considerations 795 796 This text is intentionally specified strictly in terms of the Unicode 797 Bidi properties. The determination that the condition is sufficient 798 to fulfill the criteria depends on the Unicode Bidi algorithm; it is 799 unlikely that drastic changes will be made to this algorithm. 800 801 However, the determination of validity for any string depends on the 802 Unicode Bidi property values, which are not declared immutable by the 803 Unicode Consortium. Furthermore, the behavior of the algorithm for 804 any given character is likely to be linguistically and culturally 805 sensitive, so while it should occur rarely, it is possible that later 806 versions of the Unicode Standard may change the Bidi properties 807 assigned to certain Unicode characters. 808 809 This memo does not propose a solution for this problem. 810 8118. Security Considerations 812 813 The display behavior of mixed-direction text can be extremely 814 surprising to users who are not used to it; for instance, cut and 815 paste of a piece of text can cause the text to display differently at 816 the destination, if the destination is in another directionality 817 context, and adding a character in one place of a text can cause 818 characters some distance from the point of insertion to change their 819 display position. This is, however, not a phenomenon unique to the 820 display of domain names. 821 822 The new IDNA protocol, and particularly these new Bidi rules, will 823 allow some strings to be used in IDNA contexts that are not allowed 824 today. It is possible that differences in the interpretation of 825 labels between implementations of IDNA2003 and IDNA2008 could pose a 826 security risk, but it is difficult to envision any specific 827 instantiation of this. 828 829 Any rational attempt to compute, for instance, a hash over an 830 identifier processed by IDNA would use network order for its 831 computation, and thus be unaffected by the new rules proposed here. 832 833 While it is not believed to pose a problem, if display routines had 834 been written with specific knowledge of the RFC 3454 IDNA 835 prohibitions, it is possible that the potential problems noted under 836 "Backwards Compatibility Considerations" could cause new kinds of 837 confusion. 838 839 840 841 842Alvestrand & Karp Standards Track [Page 15] 843 844RFC 5893 IDNA Right to Left August 2010 845 846 8479. Acknowledgements 848 849 While the listed editors held the pen, this document represents the 850 joint work and conclusions of an ad hoc design team. In addition to 851 the editors, this consisted of, in alphabetic order, Tina Dam, Patrik 852 Faltstrom, and John Klensin. Many further specific contributions and 853 helpful comments were received from the people listed below, and 854 others who have contributed to the development and use of the IDNA 855 protocols. 856 857 The particular formulation of the Bidi rule in Section 2 was 858 suggested by Matitiahu Allouche. 859 860 The team wishes, in particular, to thank Roozbeh Pournader for 861 calling its attention to the issue with the Thaana script, Paul 862 Hoffman for pointing out the need to be explicit about backwards 863 compatibility considerations, Ken Whistler for suggesting the basis 864 of the formalized "Character Grouping" requirement, Mark Davis for 865 commentary, Erik van der Poel for careful review, comments, and 866 verification of the rulesets, Marcos Sanz, Andrew Sullivan, and Pete 867 Resnick for reviews, and Vint Cerf for chairing the working group and 868 contributing massively to getting the documents finished. 869 87010. References 871 87210.1. Normative References 873 874 [RFC5890] Klensin, J., "Internationalized Domain Names for 875 Applications (IDNA): Definitions and Document 876 Framework", RFC 5890, August 2010. 877 878 [Unicode-UAX9] The Unicode Consortium, "Unicode Standard Annex #9: 879 Unicode Bidirectional Algorithm", September 2009, 880 <http://www.unicode.org/reports/tr9/>. 881 882 [Unicode52] The Unicode Consortium. The Unicode Standard, Version 883 5.2.0, defined by: "The Unicode Standard, Version 884 5.2.0", (Mountain View, CA: The Unicode Consortium, 885 2009. ISBN 978-1-936213-00-9). 886 <http://www.unicode.org/versions/Unicode5.2.0/>. 887 888 889 890 891 892 893 894 895 896 897 898Alvestrand & Karp Standards Track [Page 16] 899 900RFC 5893 IDNA Right to Left August 2010 901 902 90310.2. Informative References 904 905 [RFC2672] Crawford, M., "Non-Terminal DNS Name Redirection", 906 RFC 2672, August 1999. 907 908 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of 909 Internationalized Strings ("stringprep")", RFC 3454, 910 December 2002. 911 912 [RFC5891] Klensin, J., "Internationalized Domain Names in 913 Applications (IDNA): Protocol", RFC 5891, August 2010. 914 915 [SYO] "The Standardized Yiddish Orthography: Rules of 916 Yiddish Spelling, 6th ed., New York, ISBN 917 0-914512-25-0", 1999. 918 919Authors' Addresses 920 921 Harald Tveit Alvestrand (editor) 922 Google 923 Beddingen 10 924 Trondheim, 7014 925 Norway 926 927 EMail: harald@alvestrand.no 928 929 930 Cary Karp 931 Swedish Museum of Natural History 932 Frescativ. 40 933 Stockholm, 10405 934 Sweden 935 936 Phone: +46 8 5195 4055 937 Fax: 938 EMail: ck@nic.museum 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954Alvestrand & Karp Standards Track [Page 17] 955