Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8275184

change in regex character class operator precedence

    XMLWordPrintable

    Details

    • Type: CSR
    • Status: Closed
    • Priority: P3
    • Resolution: Approved
    • Fix Version/s: 9
    • Component/s: core-libs
    • Labels:
      None
    • Subcomponent:
    • Compatibility Kind:
      behavioral
    • Compatibility Risk:
      medium
    • Compatibility Risk Description:
      This was a silent behavior change that resulted in several bug reports. People had to change their regex patterns to deal with the new behavior when they upgraded from JDK 8 to a later release.
    • Interface Kind:
      Java API
    • Scope:
      SE

      Description

      Summary

      This is a retroactive CSR for JDK-6609854, which changed the behavior of regex pattern operators on character classes. That behavior change could be regarded as incompatible, but it was never properly CSRed or documented. This CSR is a first attempt at a fairly rigorous description of the change. Subsequent work may update the appropriate specifications.

      Problem

      Character classes are a feature of regex patterns. There are several set arithmetic operations possible on character classes. The operators are documented in the Pattern class specification, and they are described by some simple examples shown there. However, the behaviors of combinations of operators are not specified. The behavior in JDK 8 and earlier was well-defined and predictable, but it was complex, counterintuitive, and hard to explain.

      This change (integrated in JDK 9) makes the behavior more sensible. However, it changed the behavior in a way that broke some existing uses of regex patterns. This has resulted in several bug reports as users stumbled over the behavior change. (See links from the main bug.) Given that this change was integrated in JDK 9 and has remained in place through JDK 11 -- an LTS release -- it seems like it's too late to revert this change. Instead, it's better to leave it in place and improve the documentation of the new behavior.

      Solution

      BACKGROUND

      Character classes occur within square brackets. Nesting of character classes is also possible by nesting sets of brackets. The operators on character classes are as follows:

      Range: -

      Constructs a character class consisting of a range between two literal characters. For example, [a-z] is a character class consisting of lowercase characters in the range from a to z, inclusive.

      Negation: ^

      Immediately after the opening square bracket of a character class, negates (complements) the character class. For example, [^a-z] is a character class consisting of any character other than lowercase characters in the range a to z.

      Union: (empty)

      Results in the union of nested character classes, if they are adjacent to one another. For example, [[a-f][d-h]] is equivalent to [a-h]. A union also occurs between an outer character class and a nested character class. For example, [a-m[n-z]] is equivalent to the union of [a-m] and [n-z] which in turn is equivalent to [a-z].

      Note that several literal characters and character ranges at the same level of a character class is the definition of that character class and is not the union of multiple classes. This is true even if multiple characters or ranges at the same level are separated by an intervening nested class. For example, [a-d[e-g]h-j] is equivalent to the union of the top-level class [a-dh-j] with the nested class [e-g]. It is not equivalent to the union of three character classes [a-d], [e-g], and [h-j]. This is significant only in the JDK 8 behavior, where the union operator has a lower precedence than the negation operator.

      Intersection: &&

      Results in a character class that is the intersection of two character classes. For example, [a-h&&d-k] is equivalent to [d-h].

      (The examples above show the character class operators using literal characters and character ranges for the sake of simplicity. The set algebra operators are more useful when combined with the various predefined character classes.)

      PRECEDENCE CHANGES

      The range operator - constructs a character class from character literals, not from other character classes, so syntactically it has the highest precedence. This remains unchanged. The precedence among the negation, union, and intersection operators was changed.

      In JDK 8, the operator precedence was as follows, from highest to lowest:

      1. range -
      2. negation ^
      3. union [a][b]
      4. intersection &&

      In JDK 9 and later, the operator precedence was changed to be as follows, from highest to lowest:

      1. range -
      2. union [a][b]
      3. intersection &&
      4. negation ^

      The net effect is that the precedence of the negation operator was moved from a very high precedence to the lowest precedence. Although this is an incompatible change, it actually makes a good deal of sense. For example, given any character class [...], adding a negation operator [^...] now always negates the entire character class. This was not true in JDK 8, and its actual behavior was quite difficult to understand.

      EXAMPLES


      Pattern.compile("[^a[b]c]").matcher("b").matches()

      JDK 8: true. The negation is performed on the outer character class, which effectively is [^ac]. This is then unioned with [b].

      JDK 9: false. The union is performed before the negation, resulting in a character class is equivalent to [^abc].


      Pattern.compile("[^a&&b]").matcher("a").matches()

      JDK 8: false. The negation is performed first, effectively giving [^a], which is then intersected with [b].

      JDK 9: true. The intersection a&&b is performed first, resulting in the empty set, which is then negated, giving a character class that matches everything.


      Pattern.compile("[a[b]&&b[c]]").matcher("a").matches()

      All JDKs: false. (The behavior here hasn't changed; it is shown here for completeness of the discussion of operator precedence.) The union operations are performed first. Thus we have [ab] intersected with [bc] which results in [b].


      Specification

      The precedence of character class operators has never been specified, and this change did not update the Pattern specification at all. Eventually, the Pattern specification should be updated to be more precise in its treatment of the character class operators. This work is covered by JDK-8264671.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              smarks Stuart Marks
              Reporter:
              sherman Xueming Shen
              Reviewed By:
              Mark Reinhold
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: