Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8264160

Regex \b is not consistent with \w without UNICODE_CHARACTER_CLASS

    XMLWordPrintable

    Details

      Description

      ADDITIONAL SYSTEM INFORMATION :
      Windows 10, although that is probably irrelevant.

      Java 17 ea, but also reproducible on Java 8

      > java -version
      openjdk version "17-ea" 2021-09-14
      OpenJDK Runtime Environment (build 17-ea+14-1110)
      OpenJDK 64-Bit Server VM (build 17-ea+14-1110, mixed mode, sharing)

      openjdk version "1.8.0_222"
      OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_222-b10)
      OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.222-b10, mixed mode)

      The figures below are the ones obtained on JDK 17. Due to updates in the Unicode database, results differ between JDK versions, but the inconsistencies are always there.

      A DESCRIPTION OF THE PROBLEM :
      As already highlighted by https://bugs.openjdk.java.net/browse/JDK-6452709 and later https://bugs.openjdk.java.net/browse/JDK-8043727, the JavaDoc is too vague about the meaning of \b and \B in regular expressions. However, as the latter points out, it is usually understood that it should be consistent with \w and \W.

      This is the case in Java regexes and the UNICODE_CHARACTER_CLASS flag is used, but is not consistent when it is not used. The set of characters considered as word characters by \b is also different with and without UNICODE_CHARACTER_CLASS, so this is not a case of always using Unicode definitions for \b.

      The inconsistency means that using \b without UNICODE_CHARACTER_CLASS is basically impossible, because it does not follow any intuitive or broadly accepted definition, nor is it documented. Therefore, I am submitting this as a bug report, rather than just missing documentation like the above issues.

      A workaround is to use the subpattern `(?:(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w))` instead.

      The attached reproduction highlights the inconsistencies. My expectation is that \b (and \B) should be consistent with \w and \W, in all cases.

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      Using the test file test/Test.java provided below:

      $ javac -d bin test/Test.java
      $ java -cp bin test.Test

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      1. total: 0

      2. total: 0

      3. total: 0

      ...
      4. total: ??? (many)
      ACTUAL -
      ...
      1. 31347 false true
      1. 31348 false true
      1. 31349 false true
      1. 3134a false true
      1. total: 131829

      2. total: 0

      3. total: 0

      4. 300 false true
      4. 301 false true
      4. 302 false true
      ...
      4. total: 2672

      ---------- BEGIN SOURCE ----------
      package test;

      import java.util.regex.*;

      public class Test {
        private static Pattern basicWordCharPattern = Pattern.compile("\\w");
        private static Pattern basicWordCharForBoundaryPattern = Pattern.compile(";\\b.", Pattern.DOTALL);

        private static Pattern basicWordCharForBoundaryWithWorkaroundPattern = Pattern.compile(";(?:(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)).", Pattern.DOTALL);

        private static Pattern unicodeWordCharPattern = Pattern.compile("\\w", Pattern.UNICODE_CHARACTER_CLASS);
        private static Pattern unicodeWordCharForBoundaryPattern = Pattern.compile(";\\b.", Pattern.UNICODE_CHARACTER_CLASS | Pattern.DOTALL);

        private static String cpToString(int cp) {
          if (Character.isBmpCodePoint(cp))
            return "" + ((char) cp);
          else
            return "" + Character.highSurrogate(cp) + Character.lowSurrogate(cp);
        }

        private static boolean isBasicWordChar(int cp) {
          return basicWordCharPattern.matcher(cpToString(cp)).matches();
        }

        private static boolean isBasicWordCharForBoundary(int cp) {
          return basicWordCharForBoundaryPattern.matcher(";" + cpToString(cp)).matches();
        }

        private static boolean isBasicWordCharForBoundaryWithWorkaround(int cp) {
          return basicWordCharForBoundaryWithWorkaroundPattern.matcher(";" + cpToString(cp)).matches();
        }

        private static boolean isUnicodeWordChar(int cp) {
          return unicodeWordCharPattern.matcher(cpToString(cp)).matches();
        }

        private static boolean isUnicodeWordCharForBoundary(int cp) {
          return unicodeWordCharForBoundaryPattern.matcher(";" + cpToString(cp)).matches();
        }

        public static void main(String[] args) {
          // Print code points for which \b is not consistent with \w without UNICODE_CHARACTER_CLASS.
          int total = 0;
          for (int cp = 0; cp <= Character.MAX_CODE_POINT; cp++) {
            boolean basicWC = isBasicWordChar(cp);
            boolean basicBoundaryWC = isBasicWordCharForBoundary(cp);

            if (basicWC != basicBoundaryWC) {
              System.out.println("1. " + Integer.toHexString(cp) + " " + basicWC + " " + basicBoundaryWC);
              total++;
            }
          }
          System.out.println("1. total: " + total); // 131829, but should be 0

          System.out.println("");

          // Print code points for which the workaround is not consistent with \w without UNICODE_CHARACTER_CLASS.
          total = 0;
          for (int cp = 0; cp <= Character.MAX_CODE_POINT; cp++) {
            boolean basicWC = isBasicWordChar(cp);
            boolean basicBoundaryWithWorkaroundWC = isBasicWordCharForBoundaryWithWorkaround(cp);

            if (basicWC != basicBoundaryWithWorkaroundWC) {
              System.out.println("2. " + Integer.toHexString(cp) + " " + basicWC + " " + basicBoundaryWithWorkaroundWC);
              total++;
            }
          }
          System.out.println("2. total: " + total); // 0

          System.out.println("");

          // Print code points for which \b is not consistent with \w *with* UNICODE_CHARACTER_CLASS.
          total = 0;
          for (int cp = 0; cp <= Character.MAX_CODE_POINT; cp++) {
            boolean unicodeWC = isUnicodeWordChar(cp);
            boolean unicodeBoundaryWC = isUnicodeWordCharForBoundary(cp);

            if (unicodeWC != unicodeBoundaryWC) {
              System.out.println("3. " + Integer.toHexString(cp) + " " + unicodeWC + " " + unicodeBoundaryWC);
              total++;
            }
          }
          System.out.println("3. total: " + total); // 0 (correct; they are all consistent)

          System.out.println("");

          /* Print code points for which \b without UNICODE_CHARACTER_CLASS is inconsistent
           * with \b *with* UNICODE_CHARACTER_CLASS.
           */
          total = 0;
          for (int cp = 0; cp <= Character.MAX_CODE_POINT; cp++) {
            boolean basicBoundaryWC = isBasicWordCharForBoundary(cp);
            boolean unicodeBoundaryWC = isUnicodeWordCharForBoundary(cp);

            if (basicBoundaryWC != unicodeBoundaryWC) {
              System.out.println("4. " + Integer.toHexString(cp) + " " + basicBoundaryWC + " " + unicodeBoundaryWC);
              total++;
            }
          }
          System.out.println("4. total: " + total); // 2672 (should be much higher)
        }
      }

      ---------- END SOURCE ----------

      CUSTOMER SUBMITTED WORKAROUND :
      A workaround is to use the subpattern `(?:(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w))` instead.

      FREQUENCY : always


        Attachments

          Issue Links

            Activity

              People

              Assignee:
              igraves Ian Graves
              Reporter:
              webbuggrp Webbug Group
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated: