Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8258119

Linebreak pattern needs adjustment to conform to Unicode TR18 and PCRE

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open
    • Priority: P3
    • Resolution: Unresolved
    • Affects Version/s: 15
    • Fix Version/s: tbd
    • Component/s: core-libs
    • Labels:
      None

      Description

      Bug JDK-8235812 changed the behavior of matching of the Unicode linebreak pattern, \R. This change will be backed out by JDK-8258259.

      The problem stated in JDK-8235812 was that the pattern \R{2} did not match the string "\r\n" and the fix changed the behavior so that a match was successful. This *seemed* the correct thing to do, as the Pattern class spec has a definition for \R which is essentially

      -----
      \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
      -----

      and the behavior after the change conforms to that definition.

      The problem is that this definition of the \R pattern doesn't match the recommendation from TR18, which is

      -----
      (?:\u000D\u000A)|(?!\u000D\u000A)[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
      -----

      (Based on http://unicode.org/reports/tr18/#Line_Boundaries and corrected and transliterated to Java regex syntax.)

      The salient difference is the appearance of a negative lookahead pattern "?!" which causes the pattern not to match a \r if it's immediately followed by \n. Thus, the TR18 recommendation would have the pattern \R{2} NOT match the string "\r\n". Indeed, PCRE has this behavior.

      The Pattern spec's definition of \R should be revisited to see if it should be adjusted to match TR18 more closely. The test cases removed in the backout changeset JDK-8258259 should be revisited. The code changes should also be revisited. It seems odd that the implementation of \R doesn't simply expand to something more-or-less equivalent to the TR18 expression. It may be that there are special cases in the code to handle \R instead of treating it as a "macro" that is expanded to a more complicated sequence. It's not clear which is preferable.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              smarks Stuart Marks
              Reporter:
              smarks Stuart Marks
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated: