Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8264547

RegEx pattern matching loses character class after intersection (&&) operator

    XMLWordPrintable

    Details

    • Type: CSR
    • Status: Closed
    • Priority: P4
    • Resolution: Approved
    • Fix Version/s: 17
    • Component/s: core-libs
    • Labels:
      None
    • Subcomponent:
    • Compatibility Kind:
      behavioral
    • Compatibility Risk:
      low
    • Compatibility Risk Description:
      Hide
      This proposal changes a longstanding behavior in the regex matcher. Patterns of the shape `nested&&[nested]unnessted` currently do not match anything, for example. A pattern of the shape `[a-z]&&[a-g]h-z` would now match the entire range of characters because the matcher would now properly reflect the full intersection.
      Show
      This proposal changes a longstanding behavior in the regex matcher. Patterns of the shape `nested&&[nested]unnessted` currently do not match anything, for example. A pattern of the shape `[a-z]&&[a-g]h-z` would now match the entire range of characters because the matcher would now properly reflect the full intersection.
    • Interface Kind:
      Java API
    • Scope:
      SE

      Description

      Summary

      Regular expression pattern matching loses character class after intersection (&&) operator. This is a fix to a bug in the regex compiler when compiling intersection && operators so that it does not drop certain character classes. The buggy behavior is long standing and has existed since at least JDK 7, but likely earlier.

      Problem

      When character classes are mixed both inside of square brackets ([..]) on the right hand side of an intersection operator && we observe the compiler dropping some of them in the matchers it produces. This creates broken matchers that are missing important character classes. Without a fix this behavior remains in a broken state. This is publicly documented and known (see the second paragraph in the "Intersection of Multiple Classes" subsection).

      Solution

      The solution is to fix a bug where the regex compiler clobbers matchers it constructs for the right-hand-side of the intersection operation where it should be merging them with union operators. This brings functionality in line with that seen in Ruby's regular expressions. Python's bundled re library doesn't support intersection. Perl and JavaScript do not support nested expressions inside of square brackets similar to how Java and Ruby already do.

      Specification

      --- a/src/java.base/share/classes/java/util/regex/Pattern.java
      +++ b/src/java.base/share/classes/java/util/regex/Pattern.java
      @@ -2663,7 +2663,11 @@ loop:   for(int x=0, offset=0; x<nCodePoints; x++, offset+=len) {
                                           right = right.union(clazz(true));
                                   } else { // abc&&def
                                       unread();
      -                                right = clazz(false);
      +                                if(right == null) {
      +                                    right = clazz(false);
      +                                } else {
      +                                    right = right.union(clazz(false));
      +                                }
                                   }
                                   ch = peek();
                               }

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              igraves Ian Graves
              Reporter:
              webbuggrp Webbug Group
              Reviewed By:
              Roger Riggs
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: