Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-6957230

CharsetEncoder.maxBytesPerChar() reports 4 for UTF-8; should be 3

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: P4
    • Resolution: Fixed
    • Affects Version/s: 7
    • Fix Version/s: 7
    • Component/s: core-libs
    • Labels:
    • Subcomponent:
    • Resolved In Build:
      b121
    • CPU:
      x86
    • OS:
      linux
    • Verification:
      Verified

      Description

      A DESCRIPTION OF THE REQUEST :
      Short summary: CharsetEncoder.maxBytesPerChar() returns a value of 4.0 for UTF-8. However, the *real* value should be 3.0. While it is possible for a code point to produce a 4 byte UTF-8 sequence, these code points require *two UTF-16 characters*, thus these code points have a bytes per char value of 2.


      JUSTIFICATION :
      This is a performance issue, not a correctness issue: The code path for String.getBytes("UTF-8") ends up allocating a *worst case* sized buffer, computed based on this value. Reducing this from 4.0 to 3.0 will reduce garbage collection rates for string processing applications.


      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      Charset.forName("UTF-8").newEncoder().maxBytesPerChar() should return 3.0

      See the example code for a program that computes and verifies this value.
      ACTUAL -
      Charset.forName("UTF-8").newEncoder().maxBytesPerChar() returns 4.0

      ---------- BEGIN SOURCE ----------
      import java.nio.charset.Charset;

      public class Test {
          public static void main(String[] arguments)
                  throws java.io.UnsupportedEncodingException {
              System.out.println("Reported max bytes per char: " +
                      Charset.forName("UTF-8").newEncoder().maxBytesPerChar());

              double maxBytesPerChar = -1;
              for (int i = 0; i <= Character.MAX_CODE_POINT; i++) {
                  String s = new String(Character.toChars(i));
                  assert 0 < s.length() && s.length() <= 2;
                  byte[] utf8 = s.getBytes("UTF-8");

                  double bytesPerChar = utf8.length / (double) s.length();
                  if (bytesPerChar > maxBytesPerChar) {
                      maxBytesPerChar = bytesPerChar;
                  }
              }

              System.out.println("Computed real max bytes per char: " +
                      maxBytesPerChar);
          }
      }

      ---------- END SOURCE ----------

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              sherman Xueming Shen
              Reporter:
              webbuggrp Webbug Group
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Imported:
                Indexed: