Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8059325

The documentation of regex $ is still wrong

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open
    • Priority: P4
    • Resolution: Unresolved
    • Affects Version/s: 8u20
    • Fix Version/s: tbd
    • Component/s: core-libs
    • Labels:

      Description

      A DESCRIPTION OF THE PROBLEM :
      The "line terminators" documentation of the java.util.Pattern class says:

      >By default, the regular expressions ^ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire input sequence. If MULTILINE mode is activated then ^ matches at the beginning of input and after any line terminator except at the end of input. When in MULTILINE mode $ matches just before a line terminator or the end of the input sequence.

      Further down in the description of the MULTILINE mode constant, it says:

      >In multiline mode the expressions ^ and $ match just after or just before, respectively, a line terminator or the end of the input sequence. By default these expressions only match at the beginning and the end of the entire input sequence.

      We know experimentally that is not true because $ does also match just before a line terminator at the end of the input sequence when not in multiline mode:

          System.out.println(java.util.regex.Pattern.compile("x$").matcher("x\n").find()); // true, even though $ is not at the end of the input sequence

      I'll repeat what the documentation says again since I know from experience that it takes umpteen reports and exhaustive repetition to get anything done around here. The documentation says that "By default" (that is, when not in multiline mode, which we are not), "these expressions" (that is, ^ and $) "only match" (**ONLY**, meaning there are no other weird cases where they could possibly match) "at the beginning and the end of the entire input sequence" (^ only matches at the beginning of the entire input sequence; $ only matches at the end of the entire input sequence).

      Extracting the relevant information, we get the following *incorrect* description of $:

      >When not in multiline mode, $ matches only at the end of the entire input sequence.

      When I last reported this issue, the reviewer concocted an incorrect interpretation of the word "ignore" in the first of the two paragraphs of the documentation, believing that to "ignore" line terminators means "to detect them in order to actively make allowances for their presence in the input sequence in order to match what comes next". However, the introductory "by default" in the documentation indicates that what follows is the distinction between what happens when not in multiline mode compared to what happens in multiline mode. It explains that "^ and $ ignore line terminators" **AS OPPOSED TO MULTILINE MODE WHERE IT IS EXPLAINED HOW THEY DO NOT IGNORE LINE TERMINATORS**. That sentence stands in contrast to the two that follow, not alone, so it does **NOT** mean that "^ and $ ignore line terminators" "by making allowances for their presence in the input sequence".

      Admittedly, yes, it is possible for a fool to misinterpret the language so that it matches the actual behavior. However, that does not stand up to further scrutiny -- if ^ and $ "ignored" line terminators according to the misinterpretation of "ignore", then we should also expect the following to output 'true' because ^ is also supposed to "ignore" line terminators:

          System.out.println(java.util.regex.Pattern.compile("^x").matcher("\nx").find()); // oh but actually it's false

      The second of the two paragraphs of the documentation uses similar language except that the sentence order is reversed (so "by default" comes second) and the word "ignore" is not used at all. There the error is not debatable or complicated. It is blatant.

      Here is the correct description of ^ and $. This is what the documentation SHOULD SAY:

      In single-line mode (when the MULTILINE mode constant not used):
          ^ matches at the beginning of the input sequence.
          $ matches at the end of the input sequence or just before a line terminator at the end of the input sequence.
      When multiline mode is activated, ^ and $ adopt different behavior:
          ^ matches at the beginning of the input sequence or just after any line terminator -- *unless* it is at the end of the input sequence, where it does not match at all. (Unlike in single-line mode, ^ cannot match the empty string.)
          $ matches at the end of the input sequence or just before any line terminator.

      This is something like the fourth time I've reported this and I'm sick of it. It is really not complicated, so kindly make the minimal effort it takes to get it right this time. If you still feel like arguing because you still don't understand the problem, then find someone who does understand it.

      P.S. It is absurd that the bug form requires selection of a specific operating system to report a documentation error.


      URL OF FAULTY DOCUMENTATION :
      http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              sherman Xueming Shen
              Reporter:
              webbuggrp Webbug Group
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated: