Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8273296

Character.getName doesn't follow Unicode spec for ideographs

    XMLWordPrintable

    Details

    • Type: CSR
    • Status: Closed
    • Priority: P4
    • Resolution: Approved
    • Fix Version/s: 18
    • Component/s: core-libs
    • Labels:
      None
    • Subcomponent:
    • Compatibility Risk:
      minimal
    • Compatibility Risk Description:
      This is a doc-only change.
    • Interface Kind:
      Java API
    • Scope:
      SE

      Description

      Summary

      Clarify the spec of j.l.Character#getName(int) and j.l.Character#codePointOf(String) in terms of Unicode Standard conformance.

      Problem

      Those methods employ JDK's own scheme to derive/parse character names for characters that do not explicitly have names in the UnicodeData.txt file. JDK's scheme deviates from the scheme defined as in Unicode Name Property section in the Unicode Standard.

      Solution

      Clarify the deviation explicitly in their method descriptions. The bug submitter suggests changing the scheme aligned with Unicode, but it is not possible as it would introduce a compatibility issue, where the name generated with prior JDKs cannot be used for the new codePointOf(String) method.

      Specification

      Change the method descriptions of j.l.Character#getName(int) and j.l.Character#codePointOf(String) as follows:

      getName(int):

           /**
      -     * Returns the Unicode name of the specified character
      +     * Returns the name of the specified character
            * {@code codePoint}, or null if the code point is
            * {@link #UNASSIGNED unassigned}.
            * <p>
      -     * Note: if the specified character is not assigned a name by
      +     * If the specified character is not assigned a name by
            * the <i>UnicodeData</i> file (part of the Unicode Character
            * Database maintained by the Unicode Consortium), the returned
      -     * name is the same as the result of expression:
      +     * name is the same as the result of the expression:
            *
            * <blockquote>{@code
      @@ -11310,13 +11310,17 @@
            *     + " "
            *     + Integer.toHexString(codePoint).toUpperCase(Locale.ROOT);
            *
            * }</blockquote>
            *
      +     * For the {@code codePoint}s in the <i>UnicodeData</i> file, the name
      +     * returned by this method follows the naming scheme in the
      +     * "Unicode Name Property" section of the Unicode Standard. For other
      +     * code points, such as Hangul/Ideographs, The name generation rule above
      +     * differs from the one defined in the Unicode Standard.
      +     *
            * @param  codePoint the character (Unicode code point)
            *
      -     * @return the Unicode name of the specified character, or null if
      +     * @return the name of the specified character, or null if
            *         the code point is unassigned.
            *
            * @throws IllegalArgumentException if the specified
            *            {@code codePoint} is not a valid Unicode
            *            code point.

      codePointOf(String):

           /**
            * Returns the code point value of the Unicode character specified by
      -     * the given Unicode character name.
      +     * the given character name.
            * <p>
      -     * Note: if a character is not assigned a name by the <i>UnicodeData</i>
      +     * If a character is not assigned a name by the <i>UnicodeData</i>
            * file (part of the Unicode Character Database maintained by the Unicode
      -     * Consortium), its name is defined as the result of expression:
      +     * Consortium), its name is defined as the result of the expression:
            *
            * <blockquote>{@code
            *     Character.UnicodeBlock.of(codePoint).toString().replace('_', ' ')
      @@ -11357,16 +11361,20 @@
            * }</blockquote>
            * <p>
            * The {@code name} matching is case insensitive, with any leading and
            * trailing whitespace character removed.
            *
      -     * @param  name the Unicode character name
      +     * For the code points in the <i>UnicodeData</i> file, this method
      +     * recognizes the name which conforms to the name defined in the
      +     * "Unicode Name Property" section in the Unicode Standard. For other
      +     * code points, this method recognizes the name generated with
      +     * {@link #getName(int)} method.
      +     * @param  name the character name
            *
            * @return the code point value of the character specified by its name.
            *
            * @throws IllegalArgumentException if the specified {@code name}
      -     *         is not a valid Unicode character name.
      +     *         is not a valid character name.
            * @throws NullPointerException if {@code name} is {@code null}

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              naoto Naoto Sato
              Reporter:
              webbuggrp Webbug Group
              Reviewed By:
              Brian Burkhalter, Iris Clark, Lance Andersen
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: