Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-6642328

Speeding up Single Byte Encoders



    • Type: Bug
    • Status: Closed
    • Priority: P3
    • Resolution: Fixed
    • Affects Version/s: 7
    • Fix Version/s: 7
    • Component/s: core-libs
    • Labels:
    • Subcomponent:
    • Resolved In Build:
    • CPU:
    • OS:
    • Verification:
      Not verified


      Encoding is a tougher problem than decoding for single byte coders,
      because decoding is easily implemented using a mapping table of size 256,
      but for encoders it is generally unreasonable to use a table of size 64k.
      Nevertheless, for most coders, it is possible to write table access code
      that uses a small amount of storage (< 4kb), and uses only one read access
      to this table for all (or at least most) characters to decode.

      If the range of the mapping is small (e.g. ISO-8859-2) then we can
      throw away the trailing unmapped part of the table, and compute an index
      as follows:

      int dl = dp + Math.min(src.remaining(),
                  while (dp < dl) {
                      char c = sa[sp];
                      int i = c & ((c - charToByteLen) >> 31);
                      byte b = charToByte[i];
                      if (b != 0 || c == 0) {
                          da[dp++] = b;
                      } else
                          return encodeArrayError(src, sp, dst, dp);

      For ISO-8859-2 this gives about a factor of 3 speedup for encoding long sequences.

      If the characters fall into two widely separated ranges,
      0 - hi1, lo2 - hi2,
      we can have a table
      containing the concatenation of the two ranges,
      and compute the index thus:

      int i = (c & ((c - hi1) >> 31)) + (c & ((c - lo2) ^ (c - hi2)) >> 31);

      This should win because computation is becoming free, relative to memory access.

      This technique can be extended to more ranges as well, depending on the coder.
      For many coders, special values or ranges can be ignored on the first try.
      Only if we get ZERO (normally unmappable character) do we check the rare ranges.
      This is the way the pesky Euro character should be handled.
      We can make the Euro 5 times as slow to make all other characters twice as fast.

      The mapping table should be computed at runtime from the Decoder's table,
      and stored in a byte array, not a wasteful char array, as is the current practice.
      Here's the one for the easy case ISO-8859-2 (a single range starting at 0)

              private static final byte[] charToByte;
              private static final int charToByteLen;
              static {
                  char[] byteToChar = Decoder.byteToCharTable.toCharArray();
                  char maxChar = 0;
                  for (char c : byteToChar)
                      if (maxChar < c)
                          maxChar = c;
                  charToByte = new byte[maxChar+1];
                  for (int i = 0; i < byteToChar.length; i++)
                      charToByte[byteToChar[i]] = (byte) (i - 0x80);
                  charToByteLen = charToByte.length;




            sherman Xueming Shen
            martin Martin Buchholz
            0 Vote for this issue
            0 Start watching this issue