Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8241055

Regex Grapheme Matcher Performance Depends too much on Total Input Sequence Size

    Details

    • Subcomponent:
    • Introduced In Build:
      b24
    • Introduced In Version:
      12
    • Resolved In Build:
      b20
    • Verification:
      Not verified

      Description

      ADDITIONAL SYSTEM INFORMATION :
      System/OS/runtime-independent

      A DESCRIPTION OF THE PROBLEM :
      the more advanced the current position in a string to be matched with grapheme regexes \X the longer it takes to find the next match. For longer strings performance quickly becomes unbearable.

      The algorithm is overly complex (O(number of characters in input string) instead of O(1) per match) because Grapheme.nextBoundary always starts at the beginning of the whole input string in order to correctly identify pairs of regional indicator symbols.

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      Find all matches of \X in a not too short string and measure the time.
      Find all matches of \X in a n times longer string than the first one and again measure the time.

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      n times longer string results in n times longer processing time.
      ACTUAL -
      takes longer than that. much longer

      ---------- BEGIN SOURCE ----------
      /*
       * Copyright (c) 2020, Oracle and/or its affiliates. All rights reserved.
       * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
       *
       * This code is free software; you can redistribute it and/or modify it
       * under the terms of the GNU General Public License version 2 only, as
       * published by the Free Software Foundation.
       *
       * This code is distributed in the hope that it will be useful, but WITHOUT
       * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
       * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
       * version 2 for more details (a copy is included in the LICENSE file that
       * accompanied this code).
       *
       * You should have received a copy of the GNU General Public License version
       * 2 along with this work; if not, write to the Free Software Foundation,
       * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
       *
       * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA
       * or visit www.oracle.com if you need additional information or have any
       * questions.
       */

      import static java.lang.Math.sqrt;
      import java.util.regex.Matcher;
      import java.util.regex.Pattern;

      import org.testng.annotations.*;
      import static org.testng.Assert.*;

      /**
       * @test
       * @bug 6202130
       * @library ..
       * @run testng/manual GraphemeTimePerCharTest
       * @summary Measures {@link java.util.regex.Grapheme#nextBoundary} performance
       * depending on input character sequence size.
       */
      public class GraphemeTimePerCharTest {

          static final int STEPS = 10;
          static final int LIMIT = 1000000; // maximum string size tested

          static final Pattern CHARACTER_REGEX = Pattern.compile("\\X");

          void matchPerformance(String value) {
              Matcher charMatcher = CHARACTER_REGEX.matcher(value);
              while (charMatcher.find()) charMatcher.group();
          }

          @Test
          public void test() throws Exception {
              matchPerformance("warmup");

              double[][] strTimesByNChars = new double[STEPS][];
              double[][] chrTimesByNChars = new double[STEPS][];
              System.out.println("number of chars, time string [s], time per char [ns]");
              for (int i = 0; i < STEPS; i++) {
                  int chunkSize = (i + 1) * LIMIT / STEPS;
                  String value = "x".repeat(chunkSize);
                  long start = System.nanoTime();
                  matchPerformance(value);
                  long end = System.nanoTime();
                  long t = end - start;
                  strTimesByNChars[i] = new double[] { chunkSize, t };
                  double timePerChar = (double) t / (double) chunkSize;
                  chrTimesByNChars[i] = new double[] { chunkSize, timePerChar };
                  System.out.printf("%7d %6d %10.3f%n", chunkSize, (t / 1000000), timePerChar);
              }
              double cStr = correlate(strTimesByNChars);
              double cChr = correlate(chrTimesByNChars);
              System.out.printf("correlation factor between string size and total time (expected 1): %10.3f%n", cStr);
              System.out.printf("correlation factor between string size and time per character (expected 0): %10.3f%n", cChr);
              assertTrue(cStr > .5 && cChr < .5);
          }

          double correlate(double[][] data) {
              double sumx = 0, sumy = 0;
              for (int i = 0; i < data.length; i++) {
                  sumx += data[i][0];
                  sumy += data[i][1];
              }
              double avgx = sumx / data.length;
              double avgy = sumy / data.length;
              double sxy = 0, sxx = 0, syy = 0;
              for (int i = 0; i < data.length; i++) {
                  sxy += (data[i][0] - avgx) * (data[i][1] - avgy);
                  sxx += (data[i][0] - avgx) * (data[i][0] - avgx);
                  syy += (data[i][1] - avgy) * (data[i][1] - avgy);
              }
              return sxy / sqrt(sxx * syy);
          }

      }

      ---------- END SOURCE ----------

      CUSTOMER SUBMITTED WORKAROUND :
      No generally applicable workaround is known to me. For matching all \X after one another through all of a string, BreakIterator might do.

      FREQUENCY : always


        Attachments

          Activity

            People

            • Assignee:
              naoto Naoto Sato
              Reporter:
              webbuggrp Webbug Group
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: