Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8269290

UnicodeReader not translating \u005c\\u005d to \\]

    XMLWordPrintable

    Details

    • Type: CSR
    • Status: Closed
    • Priority: P2
    • Resolution: Approved
    • Fix Version/s: 17
    • Component/s: tools
    • Labels:
      None
    • Subcomponent:
    • Compatibility Kind:
      source
    • Compatibility Risk:
      low
    • Compatibility Risk Description:
      This is a rarely used idiom that only showed up because of corpus work at Google.
    • Interface Kind:
      Java API, Language construct
    • Scope:
      Implementation

      Description

      Summary

      A bug was introduced in the JDK 16 javac compiler that changed the interpretation of the Unicode escape \u005c as 1) an escaping backslash and 2) its non-effect on subsequent Unicode escapes.

      Problem

      This issue relates to Unicode escapes, described in section 3.3 of the JLS. javac interprets Unicode escapes during the reading of ASCII characters from source. Later on, javac interprets escape sequences, described in section 3.7 of the JLS, during the tokenization of character literals, string literals, and text blocks. Escape sequences are only indirectly affected by this bug.

      During reading, a normal backslash (that is, the ASCII \ character, not the corresponding Unicode escape \u005c) followed by another normal backslash is treated collectively as a pair of backslash characters. No further interpretation is done. This means that if a normal backslash immediately precedes the sequence \ u A B C D which would "normally" be interpreted as an Unicode escape, then the interpretation of that sequence as a Unicode escape is suppressed.

      For example, the sequence \u2022 would be interpreted as the character, whereas \\u2022 would be interpreted as the seven characters \ \ u 2 0 2 2.

      An issue arises when Java developers choose to use a Unicode escape backslash \u005c in their source code, instead of a normal backslash. Prior to JDK 16, if the Unicode escape backslash was followed by a second Unicode escape, then the second Unicode escape was always interpreted. The normal backslash at the beginning of the second Unicode escape (immediately followed by u) was not paired with the preceding Unicode escape backslash. Elsewise, any following normal backslash will be paired with the \u005c.

      For example, the sequence \u005c\u2022 would be interpreted as \ and , whereas \u005c\tXYZ would be interpreted as \ \ t X Y Z.

      The bug in JDK 16 ignored \u005c as having any effect on Unicode interpretation. Using the example from compiler-dev discussions, \u005c\\u005d :

      • Prior to JDK 16, it was interpreted as \ \ ]
      • JDK 16 interpreted it as \ \ \ u 0 0 5 d which would produce a syntax error downstream in the lexer because the escape sequence \u is invalid.

      Solution

      The proposed fix is to reintroduce the pre-JDK 16 behavior of \u005c\.

      Specification

          diff --git a/src/jdk.compiler/share/classes/com/sun/tools/javac/parser/UnicodeReader.java b/src/jdk.compiler/share/classes/com/sun/tools/javac/parser/UnicodeReader.java
          index c51be0fdf07..b089cf396cc 100644
          --- a/src/jdk.compiler/share/classes/com/sun/tools/javac/parser/UnicodeReader.java
          +++ b/src/jdk.compiler/share/classes/com/sun/tools/javac/parser/UnicodeReader.java
          @@ -85,6 +85,11 @@ public class UnicodeReader {
                */
               private boolean wasBackslash;
      
          +    /**
          +     * true if the last character was derived from an unicode escape sequence.
          +     */
          +    private boolean wasUnicodeEscape;
          +
               /**
                * Log for error reporting.
                */
          @@ -105,6 +110,7 @@ public class UnicodeReader {
                   this.character = '\0';
                   this.codepoint = 0;
                   this.wasBackslash = false;
          +        this.wasUnicodeEscape = false;
                   this.log = sf.log;
      
                   nextCodePoint();
          @@ -161,17 +167,22 @@ public class UnicodeReader {
                   // Fetch next character.
                   nextCodeUnit();
      
          -        // If second backslash is detected.
          -        if (wasBackslash) {
          -            // Treat like a normal character (not part of unicode escape.)
          -            wasBackslash = false;
          -        } else if (character == '\\') {
          -            // May be an unicode escape.
          +        if (character == '\\' && (!wasBackslash || wasUnicodeEscape)) {
          +            // Is a backslash and may be an unicode escape.
                       switch (unicodeEscape()) {
          -                case BACKSLASH -> wasBackslash = true;
          -                case VALID_ESCAPE -> wasBackslash = false;
          +                case BACKSLASH -> {
          +                    wasUnicodeEscape = false;
          +                    wasBackslash = !wasBackslash;
          +                }
          +                case VALID_ESCAPE -> {
          +                    wasUnicodeEscape = true;
          +                    wasBackslash = character == '\\' && !wasBackslash;
          +                }
                           case BROKEN_ESCAPE -> nextUnicodeInputCharacter(); //skip broken unicode escapes
                       }
          +        } else {
          +            wasBackslash = false;
          +            wasUnicodeEscape = false;
                   }
      
                   // Codepoint and character match if not surrogate.
          @@ -297,6 +308,7 @@ public class UnicodeReader {
                   position = pos;
                   width = 0;
                   wasBackslash = false;
          +        wasUnicodeEscape = false;
                   nextCodePoint();
               }

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              jlaskey Jim Laskey
              Reporter:
              jlaskey Jim Laskey
              Reviewed By:
              Jan Lahoda
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: