Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-7080302

the normalization in java regex pattern may have flaw

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: P4
    • Resolution: Fixed
    • Affects Version/s: 6u26
    • Fix Version/s: 9
    • Component/s: core-libs
    • Labels:

      Description

      FULL PRODUCT VERSION :
      $java -version

      java version "1.6.0_26"
      Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
      Java HotSpot(TM) Client VM (build 20.1-b02, mixed mode, sharing)


      ADDITIONAL OS VERSION INFORMATION :
      $uname -a

      Linux lc_rh_8 2.4.27 #3 SMP Fri Oct 31 16:51:51 GMT 2008 i686 i686 i386 GNU/Linux

      A DESCRIPTION OF THE PROBLEM :
      This problem happens only when the unicode canonical equivalent match enabled (Pattern.CANON_EQ).

      It seems to me that the pattern string normalization is not handled correctly. When I add a capturing group (a pair of parentheses) to enclose one unicode string (one base character and followed by the two NON_SPACING_MARK characters), the right parenthesis is somehow treated as the NON_SPACING_MARK character as well.


      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      Use the source code I pasted below:

      javac Test.java
      java Test abcd

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      it should print "false"

      ACTUAL -
      an exception was thrown out:

      Exception in thread "main" java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 14
      a((?:A??)|??)|?)?|A?)?|?)?|??)|A??)|Á?)|??)|?)?|Á)?|A?)?|Á)?|Á?)|??)|?)?|A)??|A)??)
                    ^
              at java.util.regex.Pattern.error(Pattern.java:1713)
              at java.util.regex.Pattern.compile(Pattern.java:1464)
              at java.util.regex.Pattern.<init>(Pattern.java:1133)
              at java.util.regex.Pattern.compile(Pattern.java:847)
              at Test.<init>(Test.java:8)
              at Test.main(Test.java:19)


      REPRODUCIBILITY :
      This bug can be reproduced always.

      ---------- BEGIN SOURCE ----------
      import java.util.regex.*;

      public class Test {
          private Pattern pattern;

          public Test() {
              String patternString = "a(\u0041\u0301\u0328)"; // capture group 1
              pattern = Pattern.compile(patternString, Pattern.CANON_EQ); // unicode canonical equivalent match
          }

          boolean match(String s) {
              Matcher m = pattern.matcher(s);
              return m.find();
          }


          public static void main(String[] argv) {
              if (argv.length > 0) {
                  boolean matched = new Test().match(argv[0]);
                  System.out.println(matched);
              }
          }
      }

      ---------- END SOURCE ----------

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                sherman Xueming Shen
                Reporter:
                webbuggrp Webbug Group
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                  Imported:
                  Indexed: