Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8046101

JEP 111: Additional Unicode Constructs for Regular Expressions

    Details

    • Type: JEP
    • Status: Candidate
    • Priority: P4
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: core-libs
    • Labels:
      None
    • Author:
      Xueming Shen
    • JEP Type:
      Feature
    • Exposure:
      Open
    • Scope:
      SE
    • Discussion:
      core dash libs dash dev at openjdk dot java dot net
    • Effort:
      S
    • Duration:
      S
    • JEP Number:
      111

      Description

      Summary

      Adopt further regular-expression constructs from from Unicode TR#18.

      Motivation

      The primary motivation is to enhance/enrich the Unicode support level to allow developers to write sophisticated Unicode-enabled regular expressions on the Java platform. This is important to keep the Java Platform competitive with other languages that already offer more complete support for Unicode regular expressions.

      Description

      Java Regular Expressions are derived from Perl Regular Expression and are supposed to provide Java developers most of the Perl style regression expression features. Perl Regular Expressions have evolved rapidly in the past couple years to follow Unicode Standard TR#18 Unicode Regular Expressions. Java Regular Expressions have claimed to be in conformance with Level 1 of the same Unicode Standard TR#18 Unicode Regular Expressions, plus RL2.1 Canonical Equivalents, which is the "lowest" level of conformance. Given that the Unicode Standard has been widely accepted as the de facto standard for development platforms and Java uses Unicode as its internal encoding scheme, it appears that higher-level Unicode support is desirable for developers working on Unicode-aware applications. The following new constructs and features are proposed to provide better Unicode support in Java Regular Expressions:

      • \N \{...\} -- Unicode Name Properties
      • \X -- Extended Grapheme Clusters
      • Fix the broken Canonical Equivalent support
      • \R -- Unicode line-break sequence, as suggested at TR#18 Line Boundaries
      • \g \{...\} -- Perl style construct for named capturing group and capturing group
      • More complete Unicode properties, as in \p \{IsXXXX\}
      • \h \H \v \V -- Horizontal/vertical whitespace

      Testing

      All the features (new regex constructs) listed here will be covered by the new unit tests and run by the existing test framework.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                sherman Xueming Shen
                Reporter:
                sherman Xueming Shen
                Owner:
                Xueming Shen
                Endorsed By:
                Brian Goetz
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated: