Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8216332

Grapheme regex does not work with emoji sequences

    Details

      Description

      ADDITIONAL SYSTEM INFORMATION :
      openjdk version "12-ea" 2019-03-19
      OpenJDK Runtime Environment (build 12-ea+26)
      OpenJDK 64-Bit Server VM (build 12-ea+26, mixed mode, sharing)

      A DESCRIPTION OF THE PROBLEM :
      Emoji sequences like 👨🏾 or 👨‍👩‍👦 are not clustered using the regular expression matcher \b{g} (A Unicode extended grapheme cluster boundary).

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      String stringmoji = new StringBuilder().appendCodePoint(0x1f468).appendCodePoint(0x1f3fe).appendCodePoint(0x1f468).appendCodePoint(0x200d).appendCodePoint(0x1f469).appendCodePoint(0x200d).appendCodePoint(0x1f466).toString();
      Pattern pattern = Pattern.compile("\\b{g}");
      Function<String, String> toCodePointNumber = (cp) -> cp.codePoints().mapToObj(c -> String.format("%04x", c)).collect(Collectors.joining(",")); System.out.println(pattern.splitAsStream(stringmoji).map(toCodePointNumber).collect(Collectors.joining("][","[","]")));

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      [1f468,1f3fe][1f468,200d,1f469,200d,1f466]
      ACTUAL -
      [1f468][1f3fe][1f468,200d][1f469,200d][1f466]

      FREQUENCY : always


        Attachments

          Activity

            People

            • Assignee:
              lancea Lance Andersen
              Reporter:
              webbuggrp Webbug Group
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: