Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8269406

3.3: Clarify the effect of Unicode escape processing

    XMLWordPrintable

    Details

    • Type: Enhancement
    • Status: Resolved
    • Priority: P4
    • Resolution: Fixed
    • Affects Version/s: 16
    • Fix Version/s: 17
    • Component/s: specification
    • Labels:

      Description

      JLS 3.3 has always been clear that \uABCD in the input stream is translated to a single 16-bit code unit, but has not been sufficiently clear about whether/how the result of that translation appears in the input stream for further lexical translation. (Short answer: It does, as a RawInputCharacter.)

      1. Per https://mail.openjdk.java.net/pipermail/compiler-dev/2021-June/017337.html the following text:
      -----
      A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) for the indicated hexadecimal value, and passing all other characters unchanged. ... This translation step results in a sequence of Unicode input characters.
      -----
      should be sharpened w.r.t. the result of "translating":
      -----
      A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to *a _raw input character_ which denotes* the UTF-16 code unit (§3.1) for the indicated hexadecimal value. All other characters are passed unchanged *as raw input characters*. ... This translation step results in a sequence of Unicode input characters, *all of which are raw input characters*.
      -----

      The modified text makes clear that the result of processing a Unicode escape is just a raw input character in the input stream. If the Unicode escape is \u005c, then the resulting raw input character is \ -- such a raw input character cannot serve as the \ in a \uABCD sequence, but can partake in the subsequent input processing which counts contiguous \ raw input characters.

      As a technical matter of specification, the modified text rejects the idea that the output of 3.3 -- a sequence of UnicodeInputCharacter lexemes -- can contain a mix of RawInputCharacter lexemes and UnicodeEscape lexemes. The UnicodeEscape production is necessary for specifying the syntax of a Unicode escape in the input stream, but none of the downstream sections that take UnicodeInputCharacter (e.g., 3.4's InputCharacter production) want to see a UnicodeEscape lexeme; they want to see the RawInputCharacter lexeme that results from translating six ASCII characters \ u A B C D to a single raw input character.

      2. Per https://mail.openjdk.java.net/pipermail/compiler-dev/2021-July/017585.html the following edit should be made to fully and faithfully describe the behavior of Java 15 compilers (both `javac` and `ecj`):

      -----
      ~In addition to the processing implied by the grammar,~
      +The UnicodeInputCharacter production is ambiguous because an ASCII \ character in the input stream could be reduced to either a RawInputCharacter or to the \ of a UnicodeEscape (to be followed by an ASCII u). To avoid ambiguity,+
      [All new text follows]
      for each ASCII \ character in the input stream, input processing must consider the most recent raw input characters that resulted from this translation step:

      - If the most recent raw input character was itself translated from a Unicode escape in the input stream, then the ASCII \ character is eligible to begin a Unicode escape. (For example, if the most recent raw input character in the result was a backslash that arose from a Unicode escape \u005c in the input stream, then an ASCII \ character in the input stream is eligible to begin another Unicode escape.)

      - Otherwise, consider how many backslashes appeared contiguously as raw input characters in the result, back to a non-backslash character or the start of the result. (It is immaterial whether any such backslash arose from an ASCII \ character in the input stream or from a Unicode escape \u005c in the input stream.) If this number is even, then the ASCII \ character is eligible to begin a Unicode escape; if the number is odd, then the ASCII \ character is not eligible to begin a Unicode escape.
      -----

      3. The note after "The character produced by a Unicode escape does not participate in further Unicode escapes." should be expanded:

      -----
      For example, the input stream \u005cu005a results in the six characters \ u 0 0 5 a, because 005c is the Unicode value for \. It does not result in the character Z, which is Unicode character 005a, because the \ that resulted from processing \u005c is not interpreted as the start of a further Unicode escape.

      Note that \u005cu005a cannot be written in a string literal to denote the six characters \ u 0 0 5 a. This is because the first two characters resulting from translation, \ and u, are interpreted in a string literal as an illegal escape sequence (3.10.7).

      Fortunately, the rule about contiguous \ characters helps programmers to craft input streams that denote Unicode escapes in a string literal. Denoting the six characters \ u 0 0 5 a with a string literal simply requires another \ to be placed adjacent to the existing \, for example, "Z is \\u005a". This works because the second \ in the input stream \\u005a is not eligible to begin a Unicode escape, so the first \ and the second \ are preserved as raw input characters, as are the next five characters u 0 0 5 a; the two \ characters are subsequently interpreted in a string literal as the escape sequence for a backslash, resulting in a string with the desired six characters \ u 0 0 5 a. Without the rule, the input stream \\u005a would be translated as the raw input character \ followed by the Unicode escape \u005a (Z), but this translation would be unhelpful because \Z is an illegal escape sequence in a string literal. (Note that the rule translates \u005c\u005c to \\ because the translation of the first Unicode escape to the raw input character \ does not prevent the translation of the second Unicode escape to another raw input character \.)

      The rule also allows programmers to craft input streams that denote escape sequences in a string literal. For example, the input stream \\\u006e results in the three characters \ \ n because the third \ is eligible to begin a Unicode escape and thus \u006e is translated to n, while the first \ and second \ are preserved as raw input characters. The three characters \ \ n are subsequently interpreted in a string literal as \ n which denotes the escape sequence for a linefeed. (Note that \\\u006e may be written as \u005c\u005c\u006e because each Unicode escape \u005c is translated to a raw input character \ and so the remaining input stream \u006e is preceded by an even number of \ characters and processed as the Unicode escape for n.)
      -----

      In other words, a pair of \ characters in the input stream helps to denote a Unicode escape in a string literal (\\u005a --> \ \ u 0 0 5 a --> \ u 0 0 5 a) while a triple of \ characters helps to denote an escape sequence (\\\u005a -> \ \ Z --> \ Z).

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              abuckley Alex Buckley
              Reporter:
              abuckley Alex Buckley
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: