Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-6284152

Lookbehinds with internal quantifiers are unreliable

    Details

    • Subcomponent:
    • Resolved In Build:
      b81
    • CPU:
      generic, x86
    • OS:
      generic, windows_xp

      Description

      FULL PRODUCT VERSION :


      A DESCRIPTION OF THE PROBLEM :
      A lookbehind makes an assertion about the text immediately preceding the
      current match position (that is, the position that has been reached in the
      target text at the time the lookbehind gets evaulated, which I'll call the
      CMP henceforth). That means the text matched by the lookbehind subexpression
      should end precisely at the CMP. If the subexpression contains quantifiers,
      it should employ backtracking as needed to achieve alignment with the CMP,
      just like it would for '$', '\b', or any other anchor. But the way it's
      implemented, the subexpression match is allowed to end wherever it happens
      to, then that position is compared to the CMP. That comparison is what
      determines whether the lookbehind succeeds or fails, even if backtracking
      could have changed the result.

      In the sample code, I'm trying to use lookbehind to find all matches for the
      regex 'foo\d', but only if they're preceded by a percent sign and up to five
      intervening characters (so it should match "foo1", "foo2" and "foo3"). The
      obvious regex for this is '(?<=%.{0,5})foo\d', but it fails to match "foo1"
      and "foo2". The problem is that '.{0,5}' greedily matches the maximum five
      characters, taking it past the CMP. Then, because the end position of that
      match is not the same as the CMP, the lookbehind fails. The third attempt
      succeeds only because there are exactly five characters between the "%" and
      "foo3".

      In the second regex, '(?<=%.{0,5}?)foo\d', I use a reluctant quantifier, but
      that doesn't work any better. The first attempt succeeds because there are
      zero intervening characters in that line, but on the second and third tries,
      the matcher never goes back to try matching more characters.

      In the third regex, '(?<=%.{0,5}\b)foo\d', I try using my knowledge of the
      data to help anchor the lookbehind. The first attempt still fails because
      the word-boundary assertion matches at the end of the line, but on the second
      attempt the '.{0,5}' is forced to backtrack to the correct position. This
      partial success demonstrates how lookbehinds are supposed to work: they need
      to be solidly anchored to the current match position, and the anchor has to
      be integral to the lookbehind, not applied afterward as is currently done.

      (The fourth regex demonstrates that negative lookbehinds are equally broken:
      it should match only the last two lines, but it matches "foo1" and "foo2" as
      well. The fifth regex is a workaround that I discuss below.)



      REPRODUCIBILITY :
      This bug can be reproduced always.

      ---------- BEGIN SOURCE ----------
      import java.util.regex.*;

      public class Test
      {
        public static void main(String[] args)
        {
          String str = "%foo1\n" +
                       "%bar foo2\n" +
                       "%bar foo3\n" +
                       "%blahblah foo4\n" +
                       "foo5";

          String[] rgxs = { "(?<=%.{0,5})foo\\d",
                            "(?<=%.{0,5}?)foo\\d",
                            "(?<=%.{0,5}\\b)foo\\d",
                            "(?<!%.{0,5})foo\\d",
                            "foo\\d(?<=%.{0,5}foo\\d)" };

          for (int i = 0; i < rgxs.length; i++)
          {
            Pattern p = Pattern.compile(rgxs[i]);
            Matcher m = p.matcher(str);
            System.out.println();
            System.out.println(p.pattern());
            while (m.find())
            {
              System.out.println(m.group());
            }
          }
        }
      }

      ---------- END SOURCE ----------

      CUSTOMER SUBMITTED WORKAROUND :
      Match the part you want to capture first, then use a lookbehind to check for
      the prefix, as I did in my fifth sample regex: 'foo\d(?<=%.{0,5}foo\d)'. This
      workaround seems pretty reliable, but its usefulness is very limited because
      the whole regex has to have an obvious maximum length. That means it has to
      be fairly simple and can't contain any '*', '+' or '{n,}' quantifiers.
      ###@###.### 2005-06-10 19:27:30 GMT

        Attachments

          Activity

            People

            • Assignee:
              sherman Xueming Shen
              Reporter:
              gmanwanisunw Girish Manwani (Inactive)
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Imported:
                Indexed: