Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-6636317

Optimize UTF-8 coder for ASCII input

    Details

    • Type: Bug
    • Status: Closed
    • Priority: P3
    • Resolution: Fixed
    • Affects Version/s: 7
    • Fix Version/s: 7
    • Component/s: core-libs
    • Subcomponent:
    • Resolved In Build:
      b35
    • CPU:
      generic
    • OS:
      generic
    • Verification:
      Not verified

      Description

      The UTF-8 coder can get dramatic speedup by having a special method
      that handles only ASCII, and delegates to a general purpose method
      if the input contains non-ASCII.

      Here's the kind of method I'm thinking of:

      private CoderResult decodeArrayLoop(ByteBuffer src,
      CharBuffer dst)
      {
                  byte[] sa = src.array();
      int sp = src.arrayOffset() + src.position();
      int sl = src.arrayOffset() + src.limit();

                  char[] da = dst.array();
      int dp = dst.arrayOffset() + dst.position();
      int dl = dst.arrayOffset() + dst.limit();

                  CoderResult result = null;

                  for (;;) {
                      if (sp >= sl) {
                          result = CoderResult.UNDERFLOW;
                          break;
                      }
                      int b = sa[sp];
                      if (b < 0)
                          break;
                      if (dp >= dl) {
                          result = CoderResult.OVERFLOW;
                          break;
                      }
                      da[dp++] = (char) b;
                      sp++;
                  }
                  src.position(sp - src.arrayOffset());
                  dst.position(dp - dst.arrayOffset());
                  return result != null ? result : decodeArrayLoop1(src,dst);
              }

      The non-ASCII decoder case can be sped up as well, by not using the big switch.
      More minor improvements:

      ---

      We can get rid of the code below,
      since our implementation always guarantees it,
      and users cannot create their own buggy ByteBuffer
      or CharBuffer implementations, and even if they did,
      our code is allowed to assume it is non-buggy.

      // assert (sp <= sl);
      // sp = (sp <= sl ? sp : sl);

      ---

      In the ASCII case, the &-ing with 0x7f is useless,
      since the 0x80 bit is already guaranteed to be off.

      // da[dp++] = (char)(b1 & 0x7f);
      da[dp++] = (char) b1;

      ---

      More deviously, we can snatch a few cycles in the 2-byte case
      as follows:

                          da[dp++] = (char) (((b << 6) ^ b2) ^ 0x0f80);
      // da[dp++] = ((char)(((b1 & 0x1f) << 6) |
      // ((b2 & 0x3f) << 0)));


      ---

      Only significant for smaller coding operations, but we should only
      instantiate a Surrogate.Generator or Surrogate.Parser in the unlikely
      (in the real world) event of surrogates in the input stream.

                              if (sgg == null)
                                  sgg = new Surrogate.Generator();
      int gn = sgg.generate(uc, n, da, dp, dl);
                              ....

      ---
      The comparison below is vacuously true, since c is of type char.

      if (c <= '\uFFFF') {

      ---

        Attachments

          Activity

            People

            • Assignee:
              sherman Xueming Shen
              Reporter:
              martin Martin Buchholz
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Imported:
                Indexed: