Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8247546

Pattern matching does not skip correctly over supplementary characters

    XMLWordPrintable

    Details

    • Subcomponent:
    • Resolved In Build:
      b09
    • CPU:
      x86_64
    • OS:
      linux, windows_10
    • Verification:
      Verified

      Description

      A DESCRIPTION OF THE PROBLEM :
      The find method in java.util.regex.Matcher incorrectly skips only the first char of a supplemental codepoint when searching for an initial pattern match. The problematic code is in the java.util.regex.Pattern.Start Node which contains the following code:

                  for (; i ]]
      </div>
      </div>
      <br /> <br /> <br /> <br /> <br /> <br />
      <div class="form-group">
      <label for="system_os_info" class="col-sm-2 control-label">System
      / OS / Java Runtime Information </label>
      <div class="col-sm-8">

      <textarea id="system_os_info" name="system_os_info" style="resize: none;" placeholder="Additional system configuration information here." class="form-control" rows="4">
      Tested on openjdk 14.0.1 and 11.0.5

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      See the attached source code. The goal of the program is to replace invalid surrogate characters, properly encoded supplemental characters like the example emoji should be left unchanged.

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      The input string containing the emoji should not be matched and replaced by the pattern
      ACTUAL -
      The pattern does not match at char index 0, but then steps only one char forward (instead of one codepoint), leading to a match on the second half of the supplemental codepoint. This second char is then matched and replaced. Output (question mark is due to terminal encoding):

      ? d83d
      X 58

      ---------- BEGIN SOURCE ----------
      import java.util.regex.Pattern;

      public class ReplaceInvalidSurrogates {
          public static void main(String[] args) {
              String pileofpoo = new StringBuilder().appendCodePoint(0x1F4A9).toString();
              System.out.println(pileofpoo);

              // match low and high surrogate ranges. should only match lone surrogates, not any correctly encoded supplementary characters
              Pattern surrogates = Pattern.compile("[\\x{D800}-\\x{DBFF}\\x{DC00}-\\x{DFFF}]");

              String result = surrogates.matcher(pileofpoo).replaceAll("X");

              System.out.println(result);
              System.out.println(result.charAt(0) + " " + Integer.toHexString(result.charAt(0)));
              System.out.println(result.charAt(1) + " " + Integer.toHexString(result.charAt(1)));
          }
      }

      ---------- END SOURCE ----------

      FREQUENCY : always


        Attachments

          Activity

            People

            Assignee:
            naoto Naoto Sato
            Reporter:
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: